Putting Conjunctive Normal Form (CNF) Expansion to the Test

I recently participated in the Patent Olympics event remotely from my home.  It was quite a learning experience.

About the competition: The goal for each team is to interact with the topic authority (referee), and to discover & submit as many relevant documents as possible.  Basically it’s an interactive search task, where the topic authority provides both the information need and relevance judgements.  This year, there were 3 topics, thus 3 rounds of interaction.  Each round was limited at 26 minutes, i.e. from getting the request to submitting all the results.  A maximum of 200 results per topic was allowed for submission.  The referees can judge results at any time, even before or after the interactions, but typically, we are all busy people, so most of the judging happens during the 26×2 minutes (2 is the number of participating teams).

Our approach: We aimed to formulate the perfect query through interaction with the topic authority and with the retrieval results.  A particular kind of query that we are interested in is the Boolean Conjunctive Normal Form (CNF) queries.  I’ve mentioned the advantages of such CNF queries in prior posts about term necessity/recall and WikiQuery.

Results: very different results came out of the 3 topics.  In summary, CNF, CNF, and when in doubt, use CNF queries.  If you are interested in more details, read on.

Well, it was in the middle of the night for me, even though I took a shower and dressed myself in a shirt, my brain had no response to the first topic, which is about the analgesic effect of some chemical.  The good thing about formulating CNF queries is that even if you are brain dead, as long as your searcher is responding effectively, you can get the right query out.  So I started the CNF routine by asking what other names does this chemical have, and expanded them to the original name.  I feared of bad impression from the referee, so I didn’t ask her for synonyms of “analgesic”.  “Pain reduction” and “analgesia” were all that I expanded it with.  Now that my brain is working better, I could easily suggest “pain relief”, “relieve pain” etc.  More, if I remembered to use a thesaurus, e.g. thesaurus.com.  Overall, I didn’t do particularly bad, but for sure I frustrated the referee at some point, because I didn’t provide any genuine help, except throwing a long list of results to the referee.  So, CNF query saved my ass, so to speak.

After warming up from the first topic, the second one was quite easy.  The referee did a good job in presenting a very detailed query, made my job of formulating the CNF query very easy.  We were already getting about 90% precision at top 20 (and likely even further down the list) for the first query.  Referee was happy to see good results, and our score skyrocketed.  Found a near perfect CNF query.

The third topic was a lot trickier.  The referee asked for the manufacture of a certain potassium salt, giving only the chemical structure of the salt.  For a chemical-structure search system, this might be a very good test topic, but since my system is pure text based (Lemur CGI running on distributed indexes), I naturally asked the referee for the common name of that chemical.  The referee said the other team didn’t get that information until much later, so that wasn’t very helpful.  After some struggling with the structure, and at about half the time into the topic, the referee finally found out the name in a result patent and gave it to me.  It turned out to be a very popular sweetener.  Naturally there are many patents about using this sweetener to do things.  These were false positives, as the referee was only interested in ways of manufacturing the sweetener.  Short of time, I panicked.  Instead of doing the CNF routine of looking for synonyms of the search terms from a thesaurus (there are general thesauri and thesauri for chemicals), I did the exact opposite: using proximity operators to restrict the match, requiring the word manufacture and the name of the chemical to appear in a small window of occurrence in the result documents.  If you read my thesis proposal, you’ll know that this is exactly the kind of mismatch cases that cause ad hoc retrieval to fail.  As a result, I think I only found 2 relevant documents for this query.  After the event, I consulted a general thesaurus and found that “synthesis” is the right word to use, so a query like (synthesize OR manufacture OR produce ) AND (chemical OR other names of the chemical) would give at least 10 good hits.  Failed by not doing it the CNF way.


Scoring was based on the total number of relevant documents found, and the users’ happiness with the system.  Without the topic for which I got a near perfect CNF query, I was surprisingly already leading the competition.  Scored a bit more than last year’s champion, 2nd place this year.  Overall, with the perfect query in, I got 20 more discovered relevant documents.

In terms of UI, I scored the least among the 3 teams.  I didn’t have much time to prepare the UI, only in time to distribute the collection over 3 nodes/6 CPUs.  Since I wasn’t at the event, I didn’t get to see the other participants systems.  What a pity.

Scoring board here: http://patolympics.ir-facility.org/PatOlympics/scoreboard.html

Some learnings:

No matter how efficient you think you are, 26 mins is a short time to get a good set of results.  By consciously formulating CNF queries, one can save some time, but it’s still quite stressful when the topic is difficult, like the sweetener manufacture query.

The decision to compete by formulating the best query turns out to be a good one.  A lot of different things can be done for chemical patent search, e.g. for retrieval strategies: citation analysis, chemical structure search, chemical name matching; for document processing: named entity (chemical name, disease name) annotation.  However, within a limited time of interaction, I guess the most effective way of interaction with the searcher and the collection is still to vary and improve your query.  I don’t know what other teams have done exactly, but I’m sure, CNF querying is my secrete weapon.  And I’m glad that text search is enough for the topics this year.

Because of the short time frame, UI turned out to be a big player.  Result titles and snippets speed up the relevance judging process a lot.  I would improve the result presentation a bit to include the titles of the patents if I have time.  There are also ways of automatically submitting results from the UI, to save the time of copy-pasting.  But I was copy-pasting 200 results at once, at the moment we arrived at the final query.  I don’t think clicking a button in the UI to submit a single result would be more efficient than batch submission, except to please the referees with a fancier UI.

I made it sound simple to do CNF querying, but as you have probably noticed, if the synonym suggestion component is not integrated in the system, the user (myself) forgets to expand term by term, thus cannot formulate effective CNF queries, especially when pressed by time.  The cognitive burden of understanding the initial query and analyzing the results, and the brain power needed to interact with the searcher to keep him/her busy is already huge enough.

With CNF queries, it’s easy to further restrict them and still get a reasonable number of relevant hits.  For one query, I used proximity restrictions, and for another query I restricted the search to the claims field of the patents per request from the searcher.  Not sure whether they helped overall performance at lower ranks, but they did perceivably improve top precision.

A final word, 3 topics is definitely not enough for any serious evaluation.  The evaluation metric – the total number of relevant documents – can be easily affected by 1 topic that has lots of relevant hits.  Maybe a per topic evaluation would be something better to do.  Other factors may also affect performance, for example, the other team did not fully submit 200 results per topic, and that may have brought them back when compared on the total number of relevant documents returned.  So as always, be careful when interpreting any result.  If you are interested in testing out the CNF queries yourself, try it out here with your own information needs: boston.lti.cs.cmu.edu/wikiquery/.



Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: