注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

Koala++'s blog

计算广告学 RTB

 
 
 

日志

 
 

基于查询日志的查询扩展  

2009-12-12 18:46:30|  分类: 搜索引擎 |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |

         选译自《Probabilistic Query Expansion Using Query Logs

One important assumption behind this method is that the clicked documents are relevant to the query. This assumption may appear too strong. However, although the clicking information is not as accurate as explicit relevance judgment in traditional IR, the user's choice does suggest a certain degree of relevance. In fact, users usually do not make the choice randomly. In addition, we benefit from a large quantity of query logs. Even if some of the document clicks are erroneous, we can expect that most users do click on documents that are, or seem to be, relevant. Our previous work on using query logs to cluster similar queries also strongly supports this assumption [16]. Therefore, query logs can be taken as a very valuable resource containing abundant relevance feedback data. Thus we can overcome the problem of lacking sufficient relevance judgments in traditional relevance feedback technique. In comparison with pseudo-relevance feedback, our method has an obvious advantage: not only are the clicked documents part of the top-ranked documents, but also there is a further selection by the user. So document clicks are more reliable indications than those used in pseudo relevance feedback.

         这种方法基于一个重要的假设是点击的文档是与查询相关的。这种假设看起来是很强的。但是尽管这种点击的信息不像传统判断文档相关的方法那准确,但是用户的选择的确给出了一定相关度的建议。事实上,用户有通常并不随机地做选择。更进一步,我们从大量的日志中获取信息。就算有一些文档是乱点的,我们还是可以期望大多数用户点击的文档是,或是看起来是相关的。我们的方法有一个明显的优势,不仅点击的文档是排序高的文档,而且有用户进一步的选择。所以比起伪相关反馈文档点击是一个很可靠的指示。(译注:我感觉作者也可以这么说,如果用户都是乱点的,那么根本就没有必要优化排序结果)

         The log-based query expansion method has three other important properties. First, since the term correlations can be pre-computed offline, the initial retrieval phase is not needed anymore. Second, since query logs contain query sessions from different users, the term correlations can reflect the preference of most users. For example, if the majority of users use “windows” to search for information about Microsoft Windows product, the term “windows” will have much stronger correlations with the terms such as “Microsoft”, “OS” and “software” rather than with the terms such as “decorate”, “door” and “house”. Thus the expanded query will result in a higher ranking for the documents about Microsoft Windows. The similar idea has been used in several existing search engines, such as Direct Hit [5]. Our query expansion approach can produce the same results. Third, the term correlations may evolve along with the accumulation of user logs. The query expansion process can reflect updated user’s interests at a specific time.

         基于log的查询方法有三个重要的属性。第一,因为term相关度可以线下预计算,初始的检索不再需要,第二,因为查询logs包含从不同用户而来的查询sessionsterm相关度可以反应多数用户的倾向。比如如果大多数用户用”windows”去检索Microsoft Windows产品,term “windows”terms比如”Microsoft”, ”OS”, “software”的关联程度比”decorate”, “door”, “house”要强的多。所以扩展的查询会导致有关Microsoft Windows的文档排序更高。相似的思想已经在现实搜索引擎中使用了,比如Direct Hit。我们的查询扩展方法可以产生相同的结果。第三,term相关可以随着用户log而累加完善。查询扩展可以反应用户兴趣在特定时间的更新。

         每一个文档在文档空间中可以被表示成一个文档向量{W1(d), W2(d) …, Wn(d)},其中Wn(d)是第iterm在文档中的权重,它是用传统TF-IDF方法加权:

            基于查询日志的查询扩展 - quweiprotoss - Koala++s blog

         其中tfi(d)是第iterm的在文档D中的词频,N是文档集合的总数,ni是包含第iterm的文档数。

         要比较查询空间与文档空间,我们只需要计算文档向量和对应查询的相似度。这里用Cosine相似度计算:

             基于查询日志的查询扩展 - quweiprotoss - Koala++s blog

We noticed that many terms in the document space are never or seldom used in the users’ queries. Thus many terms in the document vector appear with very small or otherwise zero weights in its corresponding query vector. This artifact will dramatically decrease the similarity between the two vectors if they are used in the measurement. To obtain a fairer measure, we only use the most important words in the document vectors for the similarity calculation, where is the number of terms in the virtual document.

         我们注意到文档空间中的很多terms都没有或很少在用户查询中使用,所以许多在文档向量中的许多terms在对应的查询向量只有很小或是零权重。用这种相似度量会使两个向量的相似度下降的很严重。为了得到一个更合理的度量,我们只用文档向量中最重要的词来计算相似性。

              基于查询日志的查询扩展 - quweiprotoss - Koala++s blog

         Query sessions in the query logs provide a possible way to bridge the gap between the query space and the document space. Figure 2 shows how correlations between the query terms and document terms can be established through the query sessions. In general, we assume that the terms in a query are correlated to the terms in the documents that the user clicked on. If there is at least one path between one query term and one document term, a link is created between them. By analyzing a large numbers of such links, we can obtain a probabilistic measure for the correlations between the terms in these two spaces (Figure 3).

         在查询logs中的查询sessions提供了一个建立查询空间与文档空间的结合点。图2显示了如果通过查询sesssions来建立查询terms和文档terms之间的关联。一般,我们假设一次查询中的terms与用户点击文档中的terms相关。如果有至少有一个路径从一个查询term和一个文档term,就建立一个链接。通过分析大量这样的链接,我们可以得到一个这两个空间terms相关度的衡量方法。

              基于查询日志的查询扩展 - quweiprotoss - Koala++s blog

                基于查询日志的查询扩展 - quweiprotoss - Koala++s blog

         先讨论如何确定terms之间的相关度。我们定义相关度为terms之间的条件概率,比如,P(wj(d)|wi(q))。令wj(d)wi(q)分别为任意的文档term和查询term。概率P(wj(d)|wi(q))可以用如下公式计算(其中S是一个包含查询term wi(q)的查询(queries)被点击的集合)

             基于查询日志的查询扩展 - quweiprotoss - Koala++s blog

我们可以假设P(wj(d)| wi(q),Dk)= P(wj(d)|Dk),因为文档Dk与查询term wi(q)和文档term wj(d)是无关的。

            基于查询日志的查询扩展 - quweiprotoss - Koala++s blog

         P(Dk |wi(q))wi(q)出现在用户查询时文档Dk被点击的条件概率。P(wi(d) | Dk)是文档Dk被用户点击,term wi(d)出现的条件概率。P(Dk |wi(q))P(wi(d) | Dk)可以分别从查询日志中估计,用下式计算文档中terms出现的频率。

                         基于查询日志的查询扩展 - quweiprotoss - Koala++s blog

         其中fik(q)(wi(q),Dk)是查询词wi(q)和文档Dk同时出现在查询session中的次数。

f(q)(wi(q))是包含term wi(q)的查询session次数。

P(wi(d) | Dk)term wj(d)在文档Dk中标准化后的权重,标准化是通过除以文档Dkterm权重最高的值。

将公式(4), (5), (6)结合起来,我们得到如下公式计算P(wj(d)| wi(q))

          基于查询日志的查询扩展 - quweiprotoss - Koala++s blog

Our query expansion method is based on the probabilistic term correlations described above. When a new query is submitted, first, the terms in the query (after removing stop words) are extracted. Then for every query term, all correlated document terms are selected based on the conditional probability obtained by the formula (7). By combining the probabilities of all query terms, we can calculate the following cohesion weight of a document term for the new query Q:

         当一个新的查询提交,首先查询中的terms(去掉停词后)被去掉。然后每个查询term,基于公式(7)的条件概率,所有相关的文档terms都被选中,再结合所有查询terms的概率,我们可以为一个新的查询Q计算以后一个文档合起来的权重。

         基于查询日志的查询扩展 - quweiprotoss - Koala++s blog

Thus, for every query, we get a list of weighted candidate expansion terms. The top-ranked terms can be selected as expansion terms.

         那么,对于每一个查询,我们可以得到加权的候选扩展terms,排序最高的term可以选出作为扩展terms

 

  评论这张
 
阅读(1147)| 评论(0)
推荐 转载

历史上的今天

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2017