注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

Koala++'s blog

计算广告学 RTB

 
 
 

日志

 
 

相关反馈与查询扩展[3]   

2009-12-12 18:16:48|  分类: 搜索引擎 |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |

9.2 Global methods for query reformulation

Various user supports in the search process can help the user see how their searches are or are not working. This includes information about words that were omitted from the query because they were on stop lists, what words were stemmed to, the number of hits on each term or phrase, and whether words were dynamically turned into phrases. The IR system might also suggest search terms by means of a thesaurus or a controlled vocabulary. A user can also be allowed to browse lists of the terms that are in the inverted index, and thus find good terms that appear in the collection.

         在搜索过程中的很多用户支持可以帮助用户看到他们的搜索行不行的通。这包括被查询忽略的词信息,是因为它们在停词表上,它们被词干化后的形式,每个termphrase hit的数量,和这些词动态地转换成的phrasesIR系统还可以使用类词典和一个controlled词汇建议搜索词。一个用户还可以浏览在倒排表上的词列表,就可以找到一些在collection中的好的terms

             相关反馈与查询扩展[3]  - quweiprotoss - Koala++s blog

In relevance feedback, users give additional input on documents (by marking documents in the results set as relevant or not), and this input is used to reweight the terms in the query for documents. In query expansion on the other hand, users give additional input on query words or phrases, possibly suggesting additional query terms. Some search engines (especially on the web) suggest related queries in response to a query; the users then opt to use one of these alternative query suggestions. Figure 9.6 shows an example of query suggestion options being presented in the Yahoo! web search engine. The central question in this form of query expansion is how to generate alternative or expanded queries for the user. The most common form of query expansion is global analysis, using some form of thesaurus. For each term t in a query, the query can be automatically expanded with synonyms and related words of t from the thesaurus. Use of a thesaurus can be combined with ideas of term weighting: for instance, one might weight added terms less than original query terms. Methods for building a thesaurus for query expansion include:

         在查询扩展中,用户给出额外的查询词或是短语,可能的额外的查询词。一些搜索引擎(特别是在web)建议一个查询的相关查询。用户选择其中一个建议的查询。这种形式的查询扩展核心问题是如何产生可选的或是扩展的查询给用户。最常见的查询扩展形式是全局分析,用一些类词典的形式,对每一个查询中的term t,查询会被自动的用类词典中的同义词或是相近的词扩展。用一个类词典可以结合term加权的思想,例如,可以对添加的term赋以比原始查询中term小的权重。为扩展查询建立一个类词典的方法包括:

 

? Use of a controlled vocabulary that is maintained by human editors. Here, there is a canonical term for each concept. The subject headings of traditional library subject indexes, such as the Library of Congress Subject Headings, or the Dewey Decimal system are examples of a controlled vocabulary. Use of a controlled vocabulary is quite common for well resourced domains. A well-known example is the Unified Medical Language System (UMLS) used with MedLine for querying the biomedical research literature. For example, in Figure 9.7, neoplasms was added to a search for cancer. This Medline query expansion also contrasts with the Yahoo! example. The Yahoo! interface is a case of interactive query expansion, whereas PubMed does automatic query expansion. Unless the user chooses to examine the submitted query, they may not even realize that query expansion has occurred.

Ø  用编辑人工维护一个词汇表,这里对每一个概念都有一个标准的term

 

? A manual thesaurus. Here, human editors have built up sets of synonymous names for concepts, without designating a canonical term. The UMLS metathesaurus is one example of a thesaurus. Statistics Canada maintains a thesaurus of preferred terms, synonyms, broader terms, and narrower terms for matters on which the government collects statistics, such as goods and services. This thesaurus is also bilingual English and French.

Ø  人工的类词典。这里编辑可以建立概念的同义词集合,而不是指定一个标准的term

 

? An automatically derived thesaurus. Here, word co-occurrence statistics over a collection of documents in a domain are used to automatically induce a thesaurus; see Section 9.2.3.

Ø  自动产生类词典。这里在一个领域的一个文档集合对词的同现统计可用于自动产生类词典。

 

? Query reformulations based on query log mining. Here, we exploit the manual query reformulations of other users to make suggestions to a new user. This requires a huge query volume, and is thus particularly appropriate to web search.

Ø  查询基于查询日志挖掘重新形式化。这里我们利用其他用户人工查询重新形式化来对新的用户推荐。这需要大量的查询量,所以这对web搜索是很合适的。

 

Thesaurus-based query expansion has the advantage of not requiring any user input. Use of query expansion generally increases recall and is widely used in many science and engineering fields. As well as such global analysis techniques, it is also possible to do query expansion by local analysis, for instance, by analyzing the documents in the result set. User input is now usually required, but a distinction remains as to whether the user is giving feedback on documents or on query terms.

The simplest way to compute a co-occurrence thesaurus is based on term-term        similarities.

基于类词典的查询扩展有着不需要用户输入的优势,使用查询扩展一般可以提高recall,它广泛地在科学和工程领域中应用。

最简单的方法是基于词与词相似计算词同现的类词典。(后面是公式不想敲,也没意义,直接看18)

Query expansion is often effective in increasing recall. However, there is a high cost to manually producing a thesaurus and then updating it for scientific and terminological developments within a field. In general a domain specific thesaurus is required: general thesauri and dictionaries give far too little coverage of the rich domain-particular vocabularies of most scientific fields. However, query expansion may also significantly decrease precision, particularly when the query contains ambiguous terms. For example, if the user searches for interest rate, expanding the query to interest rate fascinate evaluate is unlikely to be useful. Overall, query expansion is less successful than relevance feedback, though it may be as good as pseudo relevance feedback. It does, however, have the advantage of being much more understandable to the system user.

         查询扩展通常可以有效地提高recall。但是人工地产生类词典有很高的代价,就一般领域而言,需要特定的类词典,一般的类词典和字典对于特定领域而言,只能覆盖大多数科学领域丰富词汇中很少一部分。查询扩展可能影响严重降低precision,特别是查询中有多义词时。总的来说,查询扩展不如相关反馈成功,而它与伪相关反馈差不多,但是它有着对系统用户有着更好的可理解性。

  评论这张
 
阅读(1220)| 评论(0)
推荐 转载

历史上的今天

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2017