Is Search a Solved Problem?

September 1, 2012

“Is search a solved problem?”  This question came up at the industry panel of SIGIR 2012.  However, most of the panelists evaded the question in its most strict sense, and pointed at new search problems or new relevance criteria such as the combination of relevance and recency.  In its most strict sense, this question should be whether full text ad hoc retrieval is a solved problem.  Other ways of asking this question include “What is the next big thing in retrieval modeling?”, “How can we consistently and substantially improve over the basic tf.idf retrieval models?”, “Why haven’t retrieval models improved over BM25 for the last almost 20 years?”, “How can one improve over Google at basic full-text Web search?”

I believe my dissertation on the term mismatch problem provides a satisfactory answer for many of these similar questions.

The first half of this article comments on the questions and common mis-conceptions that people have about retrieval modeling, and show how they can be answered based on the new understanding of the term mismatch problem and of its relationship to the tf.idf retrieval models.  The second half of this article directly cites paragraphs from the introduction chapter of my dissertation, since these paragraphs directly address the question “is search a solved problem”.  The interested readers can explore the full dissertation at

In IR research, there is a mis-conception or mis-perception about the limitations of the current retrieval models.  For example, in the Salton lecture of SIGIR 2012, Norbert Fuhr focused on the term independence assumption made by the Binary Independence Model (BIM).  However, what people often overlook is that another assumption is made when applying the BIM in ad hoc retrieval, e.g. in Okapi BM25.  It is commonly assumed that the probability of a term t occurring in the relevant set of documents R, P(t|R), is 0.5.  This assumption is made because in ad hoc retrieval the relevant set of documents for a query is typically unknown beforehand.  This assumption cripples the BIM by removing the relevance variable completely from the full model.  My dissertation shows evidence that in many cases it creates a much more significant problem than the term independence assumption does.  My dissertation also shows that this probability is directly related to the term mismatch problem, and that solving the term mismatch problem could improve retrieval by over 50-300%.  This means the mismatch problem happens deep down at the retrieval model level, and that it is a significant problem with a huge potential.

Because of neglecting the mismatch problem, when asked where can search be improved, people often resort to something else.  For example, in the Salton lecture, Norbert Fuhr mentioned that IR is about vagueness, which I agree, but then resorted to some sort of user interaction to clarify the information need, and stepped outside the realm of retrieval modeling.  Even within the scope of retrieval modeling, even when the search request is clearly described, like most of the TREC queries, search is still not a solved problem.  How can search be solved, when we are only getting a search accuracy of around 0.2 to 0.5 in standard evaluations?  In the industry panel of SIGIR 2012, Diane Kelly, like Norbert Fuhr, also pointed to user interactions when answering the title question.  Krishna Gade from Twitter changed the relevance criterion to consider recency along with relevance, which is just another way to avoid the question “is search a solved problem” in its original sense.  Jerome Pesenti mentioned some cases of enterprise search, such as facilitating the user in completing a task that uses search as a component, which also stepped outside search strictly defined.  I am not 100% sure whether it was Trystan Upstill who mentioned structured retrieval or semantic retrieval, but structured retrieval is more of a tool than a problem.  Furthermore, strict structured matching between queries and answer texts, e.g. phrases or using parsing structure of the query, worsens the mismatch problem.

Here, I am not trying to argue that retrieval modeling is the only problem in search.  These other directions are certainly valid and valuable research directions, and I am aware that search is only part of a bigger picture of decision making or task completion.  However, the retrieval model is the most fundamental component underlying a search system, and it affects any system or task that uses search as a component.  Fundamental limitations of the retrieval models would affect system performance, would affect the behavior of the retrieval techniques that work on top of the retrieval model, as well as the behavior of the search users.  This is true everywhere search is executed.  Because of that, understandings of the fundamentals should always be pursued with a higher level of enthusiasm than anything peripheral.  Now that we know that mismatch is an important problem in retrieval modeling, there is every reason to understand and to solve this term mismatch problem.

Of all the industry panelists, Steve Robertson’s point about searching on small collections such as desktop and some enterprise search is perhaps the only one about the core retrieval task.  More generally, a small collection typically implies a case of recall-oriented search, where an emphasis is given to finding all relevant documents in the collection.  In the case of desktop search, usually there is only one document that the user wants.  In the case of legal discovery, patent search or bio/medical retrieval, the cost of missing a relevant document can be very high.  Now that we know that mismatch is a fundamental limitation of the retrieval models, it is perhaps not so surprising that the current retrieval systems are not doing well on these recall oriented tasks.  Thus, underlying the small-collection search problem is the term mismatch problem.  If Bruce Croft were on the panel, he would have probably mentioned long queries, which is also the term mismatch in disguise, because the longer the query gets the more likely the query terms will mismatch relevant documents, and the more likely retrieval accuracy will be hurt.

On large collections like the Web, is search solved?  Definitely not.  Searchers doing informational searches are still constantly frustrated by the major Web search engines (see [Feild, Allan and Jones 2010]), and perhaps mismatch is one of the major problems causing the frustration.  Even for precision-oriented search, my dissertation shows that successfully solving the mismatch problem can substantially improve precision as well, and it is still valuable to solve mismatch.  Are there problems other than mismatch?  Yes, definitely.  Sometimes, precision can also be a problem.  However, a mismatch infected query may look like a precision problem to the untrained eye, and a wrong diagnosis may cost the searcher a lot of time trying to fix the query in futile ways.  Are there other problems?  Maybe, but that has to be carefully investigated, clearly defined and connected back to the fundamentals.

—— The following text is cited from my dissertation ——

Web search engines like Google have popularized the use of search technology to the extent that search has become part of our daily life. These Web search engines have also made search so easy to use that an ordinary search user would be satisfied with the results from the search engine most of the times. Some would even think that search is a solved problem.

This dissertation argues that search is far from being solved. As long as machines cannot perfectly understand human language as humans do, search is an unsolved problem. Some may still disagree, thinking that although it’s probably true that machines still cannot perfectly understand human language, it is search that demonstrated that a natural language processing task can be tremendously successful even with very simple algorithms and easy to compute statistics (Singhal 2001). In fact, attempts to use more complex natural language processing techniques in retrieval have mostly been proved futile. Even people working intimately with retrieval technology have the impression that the research on basic retrieval algorithms is hitting a plateau. Some may even think that there is probably not much room for improvement. After all, the retrieval models of the 1990s and early 2000s (Okapi BM25 or Statistical Language Models) are still the standard baselines to compare to, and the go-to models of the modern day retrieval systems (Manning et al. 2008, Chapters 6 and 7).

We argue that search ranking, retrieval modeling to be specific, is far from being solved.

Firstly, the success of Web search engines is largely due to the vast and diverse set of high quality documents to retrieve from. Because of the diverse set of relevant documents out there, even if a search query is not well formulated, even if the retrieval algorithm cannot return most of the relevant documents in the collection, a few good ones will match the query and be ranked at the top of the rank list to satisfy the user. Just that these search engines can return satisfying results for most queries does not necessarily mean that the retrieval models used by these systems are successful, and the impression that the simple retrieval models are successful can be just an illusion created by the successful collection of a huge set of documents for the retrieval algorithm to search against.

Secondly, in cases where the document collection is small, where the set of documents relevant to the query is small, or where a high level of retrieval recall is needed, the current retrieval systems are still far from being satisfactory. For example, in perhaps all forms of desktop search and some forms of enterprise search, the document collection is much less diverse and much smaller than the Web, and the users can still be easily frustrated by the search systems. Even in Web search, for informational searches, users are still frequently frustrated by the current search engines (Feild et al. 2010). In legal discovery, the lawyers from both sides of the litigation care a lot about not missing any potentially relevant document, and usually spend lots of time on carefully creating effective search queries to improve search effectiveness.

Some may still ask, if search is not yet solved, why are the baseline retrieval models so difficult to surpass, and where can we see any large improvements? We show in this dissertation that two central and long standing problems in retrieval, vocabulary mismatch (Furnas et al. 1987) and relevance based term weighting (Croft and Harper 1979; Greiff 1998; Metzler 2008), might be the culprit. We show that the two problems are directly related, the vocabulary mismatch problem being the more general version of the two. We show that the current retrieval models do not effectively model the vocabulary mismatch between query terms and relevant results. We show that term mismatch is a very common problem in search, and that a large potential gain is possible. We demonstrate several initial successes in addressing term mismatch in retrieval using novel prediction methods and theoretically motivated retrieval techniques. These techniques can automatically improve the retrieval system by making it mismatch-aware. Ordinary search users can also manually apply the query expansion technique studied in this work to further reduce mismatch and increase the effectiveness of their searches.


The age of reviews

November 8, 2011

It is an age of many kinds of reviews.  It is an age of good reviews, an age of bad reviews, an age of long reviews, an age of short reviews, an age of constructive reviews, as well as an age of destructive reviews.

This time, I got a totally different kind of review, one that is not only helpful, but also empathetic.  One that not only tells you what is wrong or unclear, but also how to fix them.  One that not only tells you how to fix them, but also tried to understand why the current draft is laid out so, and argued against the reason behind, just to convince you that a better way is needed.

An empathetic review from WSDM, I feel this is worth blogging/bragging about, even though the paper itself got a rejection.


Putting Conjunctive Normal Form (CNF) Expansion to the Test

June 9, 2011

I recently participated in the Patent Olympics event remotely from my home.  It was quite a learning experience.

About the competition: The goal for each team is to interact with the topic authority (referee), and to discover & submit as many relevant documents as possible.  Basically it’s an interactive search task, where the topic authority provides both the information need and relevance judgements.  This year, there were 3 topics, thus 3 rounds of interaction.  Each round was limited at 26 minutes, i.e. from getting the request to submitting all the results.  A maximum of 200 results per topic was allowed for submission.  The referees can judge results at any time, even before or after the interactions, but typically, we are all busy people, so most of the judging happens during the 26×2 minutes (2 is the number of participating teams).

Our approach: We aimed to formulate the perfect query through interaction with the topic authority and with the retrieval results.  A particular kind of query that we are interested in is the Boolean Conjunctive Normal Form (CNF) queries.  I’ve mentioned the advantages of such CNF queries in prior posts about term necessity/recall and WikiQuery.

Results: very different results came out of the 3 topics.  In summary, CNF, CNF, and when in doubt, use CNF queries.  If you are interested in more details, read on.

Well, it was in the middle of the night for me, even though I took a shower and dressed myself in a shirt, my brain had no response to the first topic, which is about the analgesic effect of some chemical.  The good thing about formulating CNF queries is that even if you are brain dead, as long as your searcher is responding effectively, you can get the right query out.  So I started the CNF routine by asking what other names does this chemical have, and expanded them to the original name.  I feared of bad impression from the referee, so I didn’t ask her for synonyms of “analgesic”.  “Pain reduction” and “analgesia” were all that I expanded it with.  Now that my brain is working better, I could easily suggest “pain relief”, “relieve pain” etc.  More, if I remembered to use a thesaurus, e.g.  Overall, I didn’t do particularly bad, but for sure I frustrated the referee at some point, because I didn’t provide any genuine help, except throwing a long list of results to the referee.  So, CNF query saved my ass, so to speak.

After warming up from the first topic, the second one was quite easy.  The referee did a good job in presenting a very detailed query, made my job of formulating the CNF query very easy.  We were already getting about 90% precision at top 20 (and likely even further down the list) for the first query.  Referee was happy to see good results, and our score skyrocketed.  Found a near perfect CNF query.

The third topic was a lot trickier.  The referee asked for the manufacture of a certain potassium salt, giving only the chemical structure of the salt.  For a chemical-structure search system, this might be a very good test topic, but since my system is pure text based (Lemur CGI running on distributed indexes), I naturally asked the referee for the common name of that chemical.  The referee said the other team didn’t get that information until much later, so that wasn’t very helpful.  After some struggling with the structure, and at about half the time into the topic, the referee finally found out the name in a result patent and gave it to me.  It turned out to be a very popular sweetener.  Naturally there are many patents about using this sweetener to do things.  These were false positives, as the referee was only interested in ways of manufacturing the sweetener.  Short of time, I panicked.  Instead of doing the CNF routine of looking for synonyms of the search terms from a thesaurus (there are general thesauri and thesauri for chemicals), I did the exact opposite: using proximity operators to restrict the match, requiring the word manufacture and the name of the chemical to appear in a small window of occurrence in the result documents.  If you read my thesis proposal, you’ll know that this is exactly the kind of mismatch cases that cause ad hoc retrieval to fail.  As a result, I think I only found 2 relevant documents for this query.  After the event, I consulted a general thesaurus and found that “synthesis” is the right word to use, so a query like (synthesize OR manufacture OR produce ) AND (chemical OR other names of the chemical) would give at least 10 good hits.  Failed by not doing it the CNF way.


Scoring was based on the total number of relevant documents found, and the users’ happiness with the system.  Without the topic for which I got a near perfect CNF query, I was surprisingly already leading the competition.  Scored a bit more than last year’s champion, 2nd place this year.  Overall, with the perfect query in, I got 20 more discovered relevant documents.

In terms of UI, I scored the least among the 3 teams.  I didn’t have much time to prepare the UI, only in time to distribute the collection over 3 nodes/6 CPUs.  Since I wasn’t at the event, I didn’t get to see the other participants systems.  What a pity.

Scoring board here:

Some learnings:

No matter how efficient you think you are, 26 mins is a short time to get a good set of results.  By consciously formulating CNF queries, one can save some time, but it’s still quite stressful when the topic is difficult, like the sweetener manufacture query.

The decision to compete by formulating the best query turns out to be a good one.  A lot of different things can be done for chemical patent search, e.g. for retrieval strategies: citation analysis, chemical structure search, chemical name matching; for document processing: named entity (chemical name, disease name) annotation.  However, within a limited time of interaction, I guess the most effective way of interaction with the searcher and the collection is still to vary and improve your query.  I don’t know what other teams have done exactly, but I’m sure, CNF querying is my secrete weapon.  And I’m glad that text search is enough for the topics this year.

Because of the short time frame, UI turned out to be a big player.  Result titles and snippets speed up the relevance judging process a lot.  I would improve the result presentation a bit to include the titles of the patents if I have time.  There are also ways of automatically submitting results from the UI, to save the time of copy-pasting.  But I was copy-pasting 200 results at once, at the moment we arrived at the final query.  I don’t think clicking a button in the UI to submit a single result would be more efficient than batch submission, except to please the referees with a fancier UI.

I made it sound simple to do CNF querying, but as you have probably noticed, if the synonym suggestion component is not integrated in the system, the user (myself) forgets to expand term by term, thus cannot formulate effective CNF queries, especially when pressed by time.  The cognitive burden of understanding the initial query and analyzing the results, and the brain power needed to interact with the searcher to keep him/her busy is already huge enough.

With CNF queries, it’s easy to further restrict them and still get a reasonable number of relevant hits.  For one query, I used proximity restrictions, and for another query I restricted the search to the claims field of the patents per request from the searcher.  Not sure whether they helped overall performance at lower ranks, but they did perceivably improve top precision.

A final word, 3 topics is definitely not enough for any serious evaluation.  The evaluation metric – the total number of relevant documents – can be easily affected by 1 topic that has lots of relevant hits.  Maybe a per topic evaluation would be something better to do.  Other factors may also affect performance, for example, the other team did not fully submit 200 results per topic, and that may have brought them back when compared on the total number of relevant documents returned.  So as always, be careful when interpreting any result.  If you are interested in testing out the CNF queries yourself, try it out here with your own information needs:


Some extras about the term necessity work

October 30, 2010

There are several important things I didn’t get to talk about at cikm 2010.  The talk focused on technical understandings of the term necessity concept and its role in term weighting.  Below lists future work that the IR field can do, and also some related work I didn’t cover.

I’ll also tell a bit of background story/history of how we arrived at the term necessity work from the previous structured retrieval work we’ve done, (Michael Bendersky asked me this question when at cikm, so at least there is some external interest).

1. There is a lot more future work possible than any one group of people can finish in one project, so I’d like to see more people working on necessity prediction and applying necessity to retrieval in different ways.  After more than 30 years of not much progress, we understand now that term necessity/recall is an important part of term weighting, (the other part being idf).  Since necessity lies in the core of the retrieval model, many kinds of intervention in retrieval are possible.  As term weighting for original query terms, for expansion terms, and/or bigrams.  As a feature to guide expansion to those terms that tend to mismatch relevant documents, or in other totally new approaches.

The important sanity check is to make sure there are in fact lots of mismatch cases between query terms and relevant documents in your dataset.  Otherwise, mismatch prediction won’t help much.  High mismatch rate senarios include e.g. long queries, or short documents, or complex query terms such as phrases or field restrictions etc.

2. Besides the ones listed in my slides, Regression Rank (Matt Lease), Term importance (Bendersky), are a few must reads.  I’ll have a fairly complete review of related work in my thesis proposal this December, if you want a rough draft, email me, or look at the WikiQuery website for a list of queries I used in doing the survey, you can follow the links on the pages to find results from Google or Yahoo.

3. How did I end up in this work from structured retrieval?

A short version of the story:

Working with sentence retrieval (short documents) and trying to find a good form of structured query leads us to using the conjunctive normal form Boolean queries, which basically tries to improve term recall.  Teaching the IR class on probabilistic models lead us to revisit the binary independence model, which lead to the understanding that P(t|R) is term necessity (necessity of a term occurring in a document in order for the document to be relevant), also just term recall (the percentage of relevant documents matching the term).  Connecting them together, we get the experiments and the paper.

A longer version of the same story:

The initial influence came from the TREC Legal track.  Yangbo and I participated in the 2nd year of the Legal track, and for both the two years, to the disappointment of all participants, no keyword ranking approach could beat the lawyers’ (Boolean, roughly conjunctive normal form – CNF) structured queries.  We were simply frustrated, without much understanding of the reason behind.  I remember Doug Oard (one of the organizers of the track) even suggested to take the top keyword ranked documents that don’t match the Boolean filter to replace the bottom ranked documents that match the Boolean filter.  Whether the idea worked or not can be found in a later year’s TREC report from Doug’s group, but that’s not so important for the current story.  The important thing is that we were all frustrated, and didn’t realize that the lawyers are actually using queries more superior than all of our current retrieval models/techniques.

The next piece of work is the CIKM 2009 poster, most of the experiments for the poster happened much earlier in summer 2008.  We automatically created conjunctive normal form (CNF) Boolean filters from the questions, to filter out top false positive sentences for a question answering application.  For the question, When did Wilt Chamberlain score 100 points?  the final Boolean filter is: (Wilt OR Chamberlain) AND (100 OR points) AND #any:date.  For the TREC Legal track, we were not involved in creating the queries, the lawyers did for us.  But now that we looked at the actual rank list returned by the keyword query, we realized that this kind of Boolean queries are very effective at filtering out false positives caused by the keyword retrieval model biasing toward high-idf terms.  For example, #any:date will match any “date” named-entity tag, which has extremely low idf, so lots of the top sentences don’t even contain a date.  But a date field is necessary to answer the question.  The term “100” also tend to be de-emphasized by the retrieval model, while Wilt and Chamberlain and score would be preferred.  Because the newswire collection contain lots of sentences about Wilt scoring many points in one season or one year etc., these irrelevant sentences will be returned at top rank.  After this work, the CNF Boolean queries firmly held its position in my mind.

The next important link in the chain of reasoning actually came from teaching the probabilistic models in Jamie’s IR class in the Spring semester of 2009 (using mostly Jamie’s slides).  When trying to explain the Binary Independence model, the probability P(t|R) showed up.  Greiff’s exploratory data analysis showed that this probability is difficult to estimate, so it draw my attention, at the time, even though only subconsciously.  I named the probability term necessity, and when teaching, I tried to explain to the students, and some of them understood it and accepted the name.

After summer 2008, I became more and more obsessed with the CNF Boolean queries.  I discovered Marti Hearst (1996)’s and Mandar Mitra et al (1998)’s  interesting work in manually formulating CNF queries (Mandar even kindly gave me the Boolean filters from his PhD thesis which was about 10 years ago!), I discovered that librarians also use such kind of queries.  And at the beginning of summer 2009, without the TA burden, I set out to automatically build CNF Boolean structured queries for ad hoc retrieval.  The initial approach was just naively grouping expansion terms into the original query concept (concept groups are from Mandar’s Boolean disjunctions), it was not successful (not better than the baseline Boolean filtered keyword retrieval).  Halfway into the summer, we decided to make it smarter by introducing term necessity predictions, so that the algorithm can know which terms to expand.  That attempt was also not very successful, and the reason is because most of the expansion terms are not related to the query concept at all.  The poor term similarity measure leads me to finally use local SVD to calculate term similarity.  Luckily adaptation of the SVD code in Lemur did not take much time.  But still, time flew by quickly, and per term expansion seemed far away, so I asked Jamie to allow me one more week, just trying to weight the original query terms using the predicted necessity values.  I still clearly remember on the way to the LTI picnic at the end of the summer, I said to Ni Lao (my officemate and good friend) that I need to show some progress within this week, or end the current attempt.

That simple experiment did show a lot of promise, and showed some real (i.e. significant) performance gain over the baseline language model on TREC description (long) queries.

Sometimes, when I look back, I think even if it were not successful, I’ll still ask for more time to analyze the errors made by the model, and try more things to make it work.  I’m quite convinced myself that such CNF queries and term necessity will work, given that the true necessity weights can give 30-80% gain in MAP.  One thing is quite certain though, I was too ambitious at that time.  Now, I’m proposing with the term weighting results, and planning to work on automatic CNF queries as thesis work.

After some good results, the rest is straightforward, though not always easy.  We spent the whole September to December of 2009 writing the work up.  Making it clear the relationship between necessity and term weighting was easy, but telling a convincing story is not.  Justifying the features was a bit tricky but worthwhile.  Including more experiments on more datasets was quite easy.

Another significant event during the experimentation and writing period was a short chat with Yiming Yang briefly at an LTI event.  I explained to her the CNF queries and Yiming immediately said oh, you are improving recall.  This turned up in the SIGIR submission, and more significantly, it made it easy to make the connection between necessity and mismatch.

Term mismatch was in the title of a proposal Jamie and I wrote.

Following that, SIGIR submission and rejection, the story is convincing, but one reviewer picked on a relatively low baseline run.  Ok, fixed.  And following that, CIKM submission and acceptance.  One reviewer from CIKM was extremely thorough, 8 main points and 16 small points about the work.  I took one week of vacation from my summer internship to do more experiments to address the feedback from all the reviewers, which finally made the paper a full 10 page long.  Sometimes I wonder how come such a simple idea would need 10 pages to explain..

After that, the feedback from the proposal also came back.  On average, the proposal reviews are much more serious (helpful) than the conference paper reviewers.  It took me over a week to carefully read through the reviews, addressing each point in my thesis proposal.  One reviewer emphasized the RIA workshop.  I re-examined the report from the workshop, and also dug out an old 2007 email from Ni Lao, summarizing a discussion between us that happened then.  The email clearly summarized two main causes of failure for search engines 1) the emphasis problem, and 2) the mismatch problem.  This became the nice looking pie chart in my cikm talk.  Ni is just amazing!  Unfortunately, this pie chart and the RIA workshop connection is not in the cikm paper, it did not arrive in time for the camera ready deadline.

With the understanding of term necessity/recall/mismatch, everything else just fits in surprisingly nicely.  From CNF Boolean queries to the mismatch rate, to necessity, to the emphasis problem.  Back in 2007, it was unclear how the emphasis problem is caused, we briefly guessed that it’s related to term weighting, but not sure what the optimal weights should be, until revisiting the BIM when teaching the IR class.  And even when creating the pie chart for the cikm 2010 talk, how term necessity connects the emphasis problem and the mismatch problem struck me.

In summary, I am usually surprisingly slow at understanding and generalization, of empirical results and theoretical concepts.  As laid out above, every transition took me months.  But I guess that’s true for most human, that’s why research progresses slowly.  Months of experimentation, writing, teaching and discussion only a couple of really important serendipity moments, and usually you don’t realize the importance of those moments until much later.  The only way out is to keep doing everything (experiments, reading) and keep thinking (writing, talking) at the same time.

I am also not very good at lateral thinking, making connections with existing concepts and papers.  This is relatively easy to solve, usually crystallize the idea, and look at the idea from various different angles will reflect related concepts and approaches.  And talking to lots of people about the idea, plus searching for related work on the Web, will complement the limited knowledge of a single person.

Finally, something practically useful: you have to try by yourself to see how Conjunctive Normal Form queries both increases recall and top precision for your searches.  And here is a useful website for you to easily create such Boolean Conjunctive Normal Form queries, store the queries, search for results, and even share the queries publicly so that other people may benefit from the queries you created.  I’ve used this WikiQuery website for the literature review of my thesis proposal, and find myself constantly coming back to the site to look at existing queries (refinding relevant documents, or finding new documents), and also create new queries as our understanding of the topic increases.