Koala++'s blog

9.2 Global methods for query reformulation

Various user supports in the search process can help the user see how their searches are or are not working. This includes information about words that were omitted from the query because they were on stop lists, what words were stemmed to, the number of hits on each term or phrase, and whether words were dynamically turned into phrases. The IR system might also suggest search terms by means of a thesaurus or a controlled vocabulary. A user can also be allowed to browse lists of the terms that are in the inverted index, and thus find good terms that appear in the collection.

         在搜索过程中的很多用户支持可以帮助用户看到他们的搜索行不行的通。这包括被查询忽略的词信息,是因为它们在停词表上,它们被词干化后的形式,每个termphrase hit的数量,和这些词动态地转换成的phrasesIR系统还可以使用类词典和一个controlled词汇建议搜索词。一个用户还可以浏览在倒排表上的词列表,就可以找到一些在collection中的好的terms

In relevance feedback, users give additional input on documents (by marking documents in the results set as relevant or not), and this input is used to reweight the terms in the query for documents. In query expansion on the other hand, users give additional input on query words or phrases, possibly suggesting additional query terms. Some search engines (especially on the web) suggest related queries in response to a query; the users then opt to use one of these alternative query suggestions. Figure 9.6 shows an example of query suggestion options being presented in the Yahoo! web search engine. The central question in this form of query expansion is how to generate alternative or expanded queries for the user. The most common form of query expansion is global analysis, using some form of thesaurus. For each term t in a query, the query can be automatically expanded with synonyms and related words of t from the thesaurus. Use of a thesaurus can be combined with ideas of term weighting: for instance, one might weight added terms less than original query terms. Methods for building a thesaurus for query expansion include:

         在查询扩展中,用户给出额外的查询词或是短语,可能的额外的查询词。一些搜索引擎(特别是在web)建议一个查询的相关查询。用户选择其中一个建议的查询。这种形式的查询扩展核心问题是如何产生可选的或是扩展的查询给用户。最常见的查询扩展形式是全局分析,用一些类词典的形式,对每一个查询中的term t,查询会被自动的用类词典中的同义词或是相近的词扩展。用一个类词典可以结合term加权的思想,例如,可以对添加的term赋以比原始查询中term小的权重。为扩展查询建立一个类词典的方法包括:


? Use of a controlled vocabulary that is maintained by human editors. Here, there is a canonical term for each concept. The subject headings of traditional library subject indexes, such as the Library of Congress Subject Headings, or the Dewey Decimal system are examples of a controlled vocabulary. Use of a controlled vocabulary is quite common for well resourced domains. A well-known example is the Unified Medical Language System (UMLS) used with MedLine for querying the biomedical research literature. For example, in Figure 9.7, neoplasms was added to a search for cancer. This Medline query expansion also contrasts with the Yahoo! example. The Yahoo! interface is a case of interactive query expansion, whereas PubMed does automatic query expansion. Unless the user chooses to examine the submitted query, they may not even realize that query expansion has occurred.

Ø  用编辑人工维护一个词汇表,这里对每一个概念都有一个标准的term


? A manual thesaurus. Here, human editors have built up sets of synonymous names for concepts, without designating a canonical term. The UMLS metathesaurus is one example of a thesaurus. Statistics Canada maintains a thesaurus of preferred terms, synonyms, broader terms, and narrower terms for matters on which the government collects statistics, such as goods and services. This thesaurus is also bilingual English and French.

Ø  人工的类词典。这里编辑可以建立概念的同义词集合,而不是指定一个标准的term


? An automatically derived thesaurus. Here, word co-occurrence statistics over a collection of documents in a domain are used to automatically induce a thesaurus; see Section 9.2.3.

Ø  自动产生类词典。这里在一个领域的一个文档集合对词的同现统计可用于自动产生类词典。


? Query reformulations based on query log mining. Here, we exploit the manual query reformulations of other users to make suggestions to a new user. This requires a huge query volume, and is thus particularly appropriate to web search.

Ø  查询基于查询日志挖掘重新形式化。这里我们利用其他用户人工查询重新形式化来对新的用户推荐。这需要大量的查询量,所以这对web搜索是很合适的。


Thesaurus-based query expansion has the advantage of not requiring any user input. Use of query expansion generally increases recall and is widely used in many science and engineering fields. As well as such global analysis techniques, it is also possible to do query expansion by local analysis, for instance, by analyzing the documents in the result set. User input is now usually required, but a distinction remains as to whether the user is giving feedback on documents or on query terms.

The simplest way to compute a co-occurrence thesaurus is based on term-term        similarities.



Query expansion is often effective in increasing recall. However, there is a high cost to manually producing a thesaurus and then updating it for scientific and terminological developments within a field. In general a domain specific thesaurus is required: general thesauri and dictionaries give far too little coverage of the rich domain-particular vocabularies of most scientific fields. However, query expansion may also significantly decrease precision, particularly when the query contains ambiguous terms. For example, if the user searches for interest rate, expanding the query to interest rate fascinate evaluate is unlikely to be useful. Overall, query expansion is less successful than relevance feedback, though it may be as good as pseudo relevance feedback. It does, however, have the advantage of being much more understandable to the system user.


