Some extras about the term necessity work

October 30, 2010

There are several important things I didn’t get to talk about at cikm 2010.  The talk focused on technical understandings of the term necessity concept and its role in term weighting.  Below lists future work that the IR field can do, and also some related work I didn’t cover.

I’ll also tell a bit of background story/history of how we arrived at the term necessity work from the previous structured retrieval work we’ve done, (Michael Bendersky asked me this question when at cikm, so at least there is some external interest).

1. There is a lot more future work possible than any one group of people can finish in one project, so I’d like to see more people working on necessity prediction and applying necessity to retrieval in different ways.  After more than 30 years of not much progress, we understand now that term necessity/recall is an important part of term weighting, (the other part being idf).  Since necessity lies in the core of the retrieval model, many kinds of intervention in retrieval are possible.  As term weighting for original query terms, for expansion terms, and/or bigrams.  As a feature to guide expansion to those terms that tend to mismatch relevant documents, or in other totally new approaches.

The important sanity check is to make sure there are in fact lots of mismatch cases between query terms and relevant documents in your dataset.  Otherwise, mismatch prediction won’t help much.  High mismatch rate senarios include e.g. long queries, or short documents, or complex query terms such as phrases or field restrictions etc.

2. Besides the ones listed in my slides, Regression Rank (Matt Lease), Term importance (Bendersky), are a few must reads.  I’ll have a fairly complete review of related work in my thesis proposal this December, if you want a rough draft, email me, or look at the WikiQuery website for a list of queries I used in doing the survey, you can follow the links on the pages to find results from Google or Yahoo.

3. How did I end up in this work from structured retrieval?

A short version of the story:

Working with sentence retrieval (short documents) and trying to find a good form of structured query leads us to using the conjunctive normal form Boolean queries, which basically tries to improve term recall.  Teaching the IR class on probabilistic models lead us to revisit the binary independence model, which lead to the understanding that P(t|R) is term necessity (necessity of a term occurring in a document in order for the document to be relevant), also just term recall (the percentage of relevant documents matching the term).  Connecting them together, we get the experiments and the paper.

A longer version of the same story:

The initial influence came from the TREC Legal track.  Yangbo and I participated in the 2nd year of the Legal track, and for both the two years, to the disappointment of all participants, no keyword ranking approach could beat the lawyers’ (Boolean, roughly conjunctive normal form – CNF) structured queries.  We were simply frustrated, without much understanding of the reason behind.  I remember Doug Oard (one of the organizers of the track) even suggested to take the top keyword ranked documents that don’t match the Boolean filter to replace the bottom ranked documents that match the Boolean filter.  Whether the idea worked or not can be found in a later year’s TREC report from Doug’s group, but that’s not so important for the current story.  The important thing is that we were all frustrated, and didn’t realize that the lawyers are actually using queries more superior than all of our current retrieval models/techniques.

The next piece of work is the CIKM 2009 poster, most of the experiments for the poster happened much earlier in summer 2008.  We automatically created conjunctive normal form (CNF) Boolean filters from the questions, to filter out top false positive sentences for a question answering application.  For the question, When did Wilt Chamberlain score 100 points?  the final Boolean filter is: (Wilt OR Chamberlain) AND (100 OR points) AND #any:date.  For the TREC Legal track, we were not involved in creating the queries, the lawyers did for us.  But now that we looked at the actual rank list returned by the keyword query, we realized that this kind of Boolean queries are very effective at filtering out false positives caused by the keyword retrieval model biasing toward high-idf terms.  For example, #any:date will match any “date” named-entity tag, which has extremely low idf, so lots of the top sentences don’t even contain a date.  But a date field is necessary to answer the question.  The term “100” also tend to be de-emphasized by the retrieval model, while Wilt and Chamberlain and score would be preferred.  Because the newswire collection contain lots of sentences about Wilt scoring many points in one season or one year etc., these irrelevant sentences will be returned at top rank.  After this work, the CNF Boolean queries firmly held its position in my mind.

The next important link in the chain of reasoning actually came from teaching the probabilistic models in Jamie’s IR class in the Spring semester of 2009 (using mostly Jamie’s slides).  When trying to explain the Binary Independence model, the probability P(t|R) showed up.  Greiff’s exploratory data analysis showed that this probability is difficult to estimate, so it draw my attention, at the time, even though only subconsciously.  I named the probability term necessity, and when teaching, I tried to explain to the students, and some of them understood it and accepted the name.

After summer 2008, I became more and more obsessed with the CNF Boolean queries.  I discovered Marti Hearst (1996)’s and Mandar Mitra et al (1998)’s  interesting work in manually formulating CNF queries (Mandar even kindly gave me the Boolean filters from his PhD thesis which was about 10 years ago!), I discovered that librarians also use such kind of queries.  And at the beginning of summer 2009, without the TA burden, I set out to automatically build CNF Boolean structured queries for ad hoc retrieval.  The initial approach was just naively grouping expansion terms into the original query concept (concept groups are from Mandar’s Boolean disjunctions), it was not successful (not better than the baseline Boolean filtered keyword retrieval).  Halfway into the summer, we decided to make it smarter by introducing term necessity predictions, so that the algorithm can know which terms to expand.  That attempt was also not very successful, and the reason is because most of the expansion terms are not related to the query concept at all.  The poor term similarity measure leads me to finally use local SVD to calculate term similarity.  Luckily adaptation of the SVD code in Lemur did not take much time.  But still, time flew by quickly, and per term expansion seemed far away, so I asked Jamie to allow me one more week, just trying to weight the original query terms using the predicted necessity values.  I still clearly remember on the way to the LTI picnic at the end of the summer, I said to Ni Lao (my officemate and good friend) that I need to show some progress within this week, or end the current attempt.

That simple experiment did show a lot of promise, and showed some real (i.e. significant) performance gain over the baseline language model on TREC description (long) queries.

Sometimes, when I look back, I think even if it were not successful, I’ll still ask for more time to analyze the errors made by the model, and try more things to make it work.  I’m quite convinced myself that such CNF queries and term necessity will work, given that the true necessity weights can give 30-80% gain in MAP.  One thing is quite certain though, I was too ambitious at that time.  Now, I’m proposing with the term weighting results, and planning to work on automatic CNF queries as thesis work.

After some good results, the rest is straightforward, though not always easy.  We spent the whole September to December of 2009 writing the work up.  Making it clear the relationship between necessity and term weighting was easy, but telling a convincing story is not.  Justifying the features was a bit tricky but worthwhile.  Including more experiments on more datasets was quite easy.

Another significant event during the experimentation and writing period was a short chat with Yiming Yang briefly at an LTI event.  I explained to her the CNF queries and Yiming immediately said oh, you are improving recall.  This turned up in the SIGIR submission, and more significantly, it made it easy to make the connection between necessity and mismatch.

Term mismatch was in the title of a proposal Jamie and I wrote.

Following that, SIGIR submission and rejection, the story is convincing, but one reviewer picked on a relatively low baseline run.  Ok, fixed.  And following that, CIKM submission and acceptance.  One reviewer from CIKM was extremely thorough, 8 main points and 16 small points about the work.  I took one week of vacation from my summer internship to do more experiments to address the feedback from all the reviewers, which finally made the paper a full 10 page long.  Sometimes I wonder how come such a simple idea would need 10 pages to explain..

After that, the feedback from the proposal also came back.  On average, the proposal reviews are much more serious (helpful) than the conference paper reviewers.  It took me over a week to carefully read through the reviews, addressing each point in my thesis proposal.  One reviewer emphasized the RIA workshop.  I re-examined the report from the workshop, and also dug out an old 2007 email from Ni Lao, summarizing a discussion between us that happened then.  The email clearly summarized two main causes of failure for search engines 1) the emphasis problem, and 2) the mismatch problem.  This became the nice looking pie chart in my cikm talk.  Ni is just amazing!  Unfortunately, this pie chart and the RIA workshop connection is not in the cikm paper, it did not arrive in time for the camera ready deadline.

With the understanding of term necessity/recall/mismatch, everything else just fits in surprisingly nicely.  From CNF Boolean queries to the mismatch rate, to necessity, to the emphasis problem.  Back in 2007, it was unclear how the emphasis problem is caused, we briefly guessed that it’s related to term weighting, but not sure what the optimal weights should be, until revisiting the BIM when teaching the IR class.  And even when creating the pie chart for the cikm 2010 talk, how term necessity connects the emphasis problem and the mismatch problem struck me.

In summary, I am usually surprisingly slow at understanding and generalization, of empirical results and theoretical concepts.  As laid out above, every transition took me months.  But I guess that’s true for most human, that’s why research progresses slowly.  Months of experimentation, writing, teaching and discussion only a couple of really important serendipity moments, and usually you don’t realize the importance of those moments until much later.  The only way out is to keep doing everything (experiments, reading) and keep thinking (writing, talking) at the same time.

I am also not very good at lateral thinking, making connections with existing concepts and papers.  This is relatively easy to solve, usually crystallize the idea, and look at the idea from various different angles will reflect related concepts and approaches.  And talking to lots of people about the idea, plus searching for related work on the Web, will complement the limited knowledge of a single person.

Finally, something practically useful: you have to try by yourself to see how Conjunctive Normal Form queries both increases recall and top precision for your searches.  And here is a useful website for you to easily create such Boolean Conjunctive Normal Form queries, store the queries, search for results, and even share the queries publicly so that other people may benefit from the queries you created.  I’ve used this WikiQuery website for the literature review of my thesis proposal, and find myself constantly coming back to the site to look at existing queries (refinding relevant documents, or finding new documents), and also create new queries as our understanding of the topic increases.

Share


Evil Fortune Cookie Notes

March 29, 2010
To show how language can be tweaked in mysterious ways to fit a wide range of interpretations, I’d like to include a few notes from fortune cookies that I actually received.

Front: A new venture will be a success.
Back: Husband
Front: Sometimes you have to give up things to be happy. (paraphrase)
Back: Friends
waiting for more notes to come..

吃了张关于停车的传票

August 24, 2008
在机场等人的时候吃的,google了一把,发现不论车里有没有人,只要车停着,只要不在上下人或货,就算是趴车乐。
 
不过不管怎么样,作为完整人生中的一部分,申诉一把先!

Kid sites list (a fairly complete one)

January 28, 2008
This is a list of websites that are of low reading difficulty level.  Not everyone of them is very good, but there are many interesting ones.
A byproduct of research (the first number is a popularity score, whether the site has many low difficulty pages).  Boys and girls, Enjoy!
 

Only when you work on it do you appreciate other people’s work

October 25, 2007
In the film, modern artists tried to recreate part of Michelangelo’s works like the David http://en.wikipedia.org/wiki/David_(Michelangelo), and the main scene of the fresco painting on the Sistine Chapel ceiling http://en.wikipedia.org/wiki/Sistine_Chapel_ceiling.  From picking the perfect marble to create the 5m tall David, to moving the marble downhill, and to moving the david from the artist’s workplace to the center of the city turned out to be non-trivial.  The artist who painted on a ceiling for just two weeks suffered from neck and back pains, while Michelangelo took 4 years of days and nights to finish the 500 sqare meter ceiling.
In a word, without doing it, it’s difficult to have the empathy to appreciate the work.  It’s the same with research.  It’s easy to criticize other people’s work you have no experience with, but if you truely tried to find better ways of finishing similar jobs, you would have been more careful about your criticisms.  Now, is empathy a good thing or a bad thing, that’s not important.  What’s important is,
You make things work, the more you do — the more tasks you finish — the easier you will appreciate other people’s works, because you understand better what’s really the thing that makes the solution work.  And because of that, the more likely you will apply the spirits of other people’s works to solve your own tasks and to finish more tasks.
In reaching these conclusions, I may have taken an engineering point of view, but actually this is the same with theories as well.  The beauty of a theory is the solution it provides to solving a problem; it’s how well fit it is to the problem.

学校后山之苦栗子

September 16, 2007
近日于学校后山偶然发现多株“栗子”树,秋天树上缀满果子,地下也掉了一片,轻松捡来一袋,甚是开心,决定放家里风干后食用。隔日约几位好友同去采摘,边摘边尝了一枚,生的“栗子”脆、无香味,咬一小口,初尝味淡,咀嚼数秒,一阵苦味猛然袭来,顿觉不妙,赶紧吐掉,结果已来不及,虽无吞食,舌上残味已回味无穷。。。与众合计生栗子本该有些涩味,遂未理睬,又轻松捡来几斤新鲜栗果,与众合议蒸熟食之。
 
次日,依网上推荐炒了一斤,试尝,发现香味增加,初尝味甘甜,数秒后,苦味再度袭来,丝毫未减…幸早有准备,饼干凉水伺候。
 
因不知为何如此之苦,于是google之,结果如下:
苦栗子非栗子,名为horse-chestnut,只果实和栗子相像,全非一类植物,只用于喂养牲畜(牛羊食之增重,猪拒食),参见:
人所常食之栗乃sweet chestnut或chestnut,参见:
 
附上图片两张示警众好友
第一张为真栗子,第二张乃苦栗子。

买了一辆94年的Honda Accord(本田·雅阁)

July 6, 2007
引擎因为以前车主保养的好还很强劲,自动换档也很流畅。虽然是开过的第一个雅阁,而且也比较老,但对于其性能还是很满意的。
已经有160,000英里的里程,好好开,认真保养,争取开上500,000mile成就一个东方不败的神话…
明天考驾照,一定要过啊~ 否则有车开不了很不爽。
贴几张照片。(车表面有些小瑕疵,有待整容所以就暂时不贴近照了hohoho)
对了,这车还是豪华版的EX,有一个能完全打开的天窗,虽然也不会有事没事把头探出去,但感觉还是很爽的~