注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

Koala++'s blog

计算广告学 RTB

 
 
 

日志

 
 

相关反馈与查询扩展[2]   

2009-12-12 18:15:29|  分类: 搜索引擎 |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |

Racchio 算法

      相关反馈与查询扩展[2]  - quweiprotoss - Koala++s blog

         q0是最初查询词的向量,alpha是它的权重。DrDnr分别是相关与不相关文档的集合,betagamma也是权重,合理的参数设置是alpha=1beta=0.75gamma=0.25。有的检索系统只接受正反馈,即gamma=0

          相关反馈与查询扩展[2]  - quweiprotoss - Koala++s blog

这是Rocchio算法的一个应用,初始的查询在标记为相关与不相关的文档中间,而反馈后的修正查询是更接近相关文档。

The success of relevance feedback depends on certain assumptions. Firstly, the user has to have sufficient knowledge to be able to make an initial query which is at least somewhere close to the documents they desire. This is needed anyhow for successful information retrieval in the basic case, but it is important to see the kinds of problems that relevance feedback cannot solve alone. Cases where relevance feedback alone is not sufficient include:

? Misspellings. If the user spells a term in a different way to the way it is spelled in any document in the collection, then relevance feedback is unlikely to be effective. This can be addressed by the spelling correction techniques of Chapter 3.

? Cross-language information retrieval. Documents in another language are not nearby in a vector space based on term distribution. Rather, documents in the same language cluster more closely together.

? Mismatch of searcher’s vocabulary versus collection vocabulary. If the user searches for laptop but all the documents use the term notebook computer, then the query will fail, and relevance feedback is again most likely ineffective.

         相关反馈的成功是基于一些假设的。首先,用户需要有足够的知道做出最初的查询,它至少与他们想要的文档是接近的。这是成功信息检索所需的基础,但是它对看到一些相关反馈不能单独解决问题的种类是重要的。不能单独由相关反馈来解决的情况包括:

Ø  拼写错误。如果用户拼写一个term与在collection中任何文档的拼写方式都不一样,那么相关反馈不太可能是有效的。这可以通过拼写纠错是的技术来解决。

Ø  跨语言信息检索。文档在不同的语言不太可能基于term分布相似,而文档在相同的语言簇更可能接近。

Ø  搜索者的词汇与collection词汇不相符,如果用户搜索laptop但是所有的文档都有notebook computer,那么这个查询会失败,而相关反馈很可能没有效果。

 

Secondly, the relevance feedback approach requires relevant documents to be similar to each other. That is, they should cluster. Ideally, the term distribution in all relevant documents will be similar to that in the documents marked by the users, while the term distribution in all nonrelevant documents will be different from those in relevant documents. Things will work well if all relevant documents are tightly clustered around a single prototype, or, at least, if there are different prototypes, if the relevant documents have significant vocabulary overlap, while similarities between relevant and nonrelevant documents are small. Implicitly, the Rocchio relevance feedback model treats relevant documents as a single cluster, which it models via the centroid of the cluster. This approach does not work as well if the relevant documents are a multimodal class, that is, they consist of several clusters of documents within the vector space. This can happen with:

         第二,相关反馈方法需要相关文档之间相似,即它们可以聚类,理想地,所有相关文档term分布应该相似,相关档即用户标记的文档,同时不相关文档应该与相关文档的term分布不同,如果所有的相关文档都很紧凑地聚在一个单一的中心点,或至少有不同的中心点,如果相关文档有明显的词汇重复,同时相关与不相关文档相似性比较小,它才是可行的。实际上,Rocchio相关反馈模型是把相关文档当成是一个簇,它通过簇的中心来建模。这种方法在相关文档属于多个类的时候是不可行的,也就是,它们在向量空间中形成了不同的几个簇。它可能发生于:

? Subsets of the documents using different vocabulary, such as Burma vs. Myanmar

? A query for which the answer set is inherently disjunctive, such as Pop stars who once worked at Burger King.

? Instances of a general concept, which often appear as a disjunction of more specific concepts, for example, felines.

Ø  文档不同的子集用不同的词汇,比如Burma vs Myanmar (Burma - (Myanmar的旧称)缅甸)

Ø  一个查询的本身就是析取,比如 Pop stars who one worked at Burger King

Ø  一个一般概念的实例,经常表现为一些更具体概念的一个析取,比如,felines

 

Some web search engines offer a similar/related pages feature: the user indicates a document in the results set as exemplary from the standpoint of meeting his information need and requests more documents like it. This can be viewed as a particular simple form of relevance feedback. However, in general relevance feedback has been little used in web search. One exception was the Excite web search engine, which initially provided full relevance feedback. However, the feature was in time dropped, due to lack of use. On the web, few people use advanced search interfaces and most would like to complete their search in a single interaction. But the lack of uptake also probably reflects two other factors: relevance feedback is hard to explain to the average user, and relevance feedback is mainly a recall enhancing strategy, and web search users are only rarely concerned with getting sufficient recall.

         一些web搜索引擎提供了近似/相关网页功能:用户从他需要的信息的角度,标记结果集中的一个文档,作为一个例子,并请求更多和它相似的文档。它可以视为是一个特别简单的相关反馈形式。但是,一般意义上的相关反馈很少在 web搜索中使用,一个例外是Excite搜索引擎,它最初提供了完整的相关反馈,但是因为使用的人少,就被取消了。在网页上,很少有用高级搜索功能,大部分人喜欢在一次交互完成他们的搜索。但是没人用可以反映出两个因素:相关反馈很难向通常用户解释,相关反馈主要是一个提高recall的策略,web搜索用户很少关注足够高的recall

Pseudo relevance feedback, also known as blind relevance feedback, provides a method for automatic local analysis. It automates the manual part of relevance feedback, so that the user gets improved retrieval performance without an extended interaction. The method is to do normal retrieval to find an initial set of most relevant documents, to then assume that the top k ranked documents are relevant, and finally to do relevance feedback as before under this assumption.

         伪相关反馈提供了一个自动局部分析的方法,它将相关反馈的人工部分自动化,所以用户不需延长交互并得到改善的结果,方法是使用通常的检索得到初始的相关文档集合,然后假设k个排序高的文档是相关的,最后在这个假设下如前面一样,做相关反馈。

 

On the web, DirectHit introduced the idea of ranking more highly documents that users chose to look at more often. In other words, clicks on links were assumed to indicate that the page was likely relevant to the query. This approach makes various assumptions, such as that the document summaries displayed in results lists (on whose basis users choose which documents to click on) are indicative of the relevance of these documents. In the original DirectHit search engine, the data about the click rates on pages was gathered globally, rather than being user or query specific. This is one form of the general area of clickstream mining. Today, a closely related approach is used in ranking the advertisements that match a web search query (Chapter 19).

         web上,DirectHit引入了用户选择浏览的文档排序进高。换句话说,点击的链接被假设页面很可能与这个查询相关。这种方法做了多种假设,比如显示在结果列表中文档的摘要是相关文档可能好的表达。在原始的DirectHit搜索引擎,页面点击率的数据是在全局范围内收集的,而不是特定用户或是特定查询。这就是clickstream mining的一种形式。今天,一个相关的方法用于匹配一个web搜索查询排序广告。

  评论这张
 
阅读(1486)| 评论(0)
推荐 转载

历史上的今天

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2017