注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

Koala++'s blog

计算广告学 RTB

 
 
 

日志

 
 

Lucene源代码分析[-3]  

2009-07-08 08:24:13|  分类: Lucene |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |

public float idf(Term term, Searcher searcher) throws IOException {

    return idf(searcher.docFreq(term), searcher.maxDoc());

}

         上一次讲完了searcher.docFreq,现在看maxDoc函数:

public int maxDoc() throws IOException {

    return reader.maxDoc();

}

    跟进去:

public int maxDoc() {

    return fieldsReader.size();

}

    fieldsReader也讲过,这里再看一下:

fieldsReader = new FieldsReader(cfsDir, segment, fieldInfos);

 

FieldsReader(Directory d, String segment, FieldInfos fn) throws IOException {

    fieldInfos = fn;

 

    fieldsStream = d.openInput(segment + ".fdt");

    indexStream = d.openInput(segment + ".fdx");

 

    size = (int) (indexStream.length() / 8);

}

    fieldsReader.size()返回的就是有多少个文档,以前也讲过.fdx保存的都是Long型,那么/8就表示有多少个文档。现在可以计算idf了:

public float idf(int docFreq, int numDocs) {

    return (float) (Math.log(numDocs / (double) (docFreq + 1)) + 1.0);

}

    看一下注释上写的log(numDocs/(docFreq+1)) + 1,我们到wiki中复习一下idf (Inverse Document Frequency),要去英文版的wiki中文版中少了一句话就是:If the term is not in the corpus, this will lead to a division-by-zero. It is therefore common to use docFreq+1。但是这引出了一个问题就是log函数在小于1的时候,它是负数,而tfidf=tf*idf,这样就不好了,所以加上1,但是你可能想log函数它会在趋向负无穷,加上1也没什么用呀,但是实际上numDocs/docFreq+1最小就是0.5,不会出现那种情况。

    我们双可以前进一步了,看Query中的weight函数中的sumOfSquaredWeight

public float sumOfSquaredWeights() {

    queryWeight = idf * getBoost(); // compute query weight

    return queryWeight * queryWeight; // square it

}

    idf乘上boost,然后再平方。再向下一个函数是queryNorm

public float queryNorm(float sumOfSquaredWeights) {

    return (float) (1.0 / Math.sqrt(sumOfSquaredWeights));

}

    最后一个函数Normalize

public void normalize(float queryNorm) {

    this.queryNorm = queryNorm;

    queryWeight *= queryNorm; // normalize query weight

    value = queryWeight * idf; // idf for document

}

    其实如果boost=1,你可以算一下,value就等于idf,但是如果boost不等于1,那么就是idf/(idf+boost)*idf。我们又回到了Hits类的构造函数,我们看getMoreDocs函数:

private final void getMoreDocs(int min) throws IOException {

    if (hitDocs.size() > min) {

       min = hitDocs.size();

    }

 

    int n = min * 2; // double # retrieved

    TopDocs topDocs = (sort == null) ? searcher.search(weight,

filter, n) : searcher.search(weight, filter, n, sort);

    length = topDocs.totalHits;

    ScoreDoc[] scoreDocs = topDocs.scoreDocs;

 

    float scoreNorm = 1.0f;

 

    if (length > 0 && topDocs.getMaxScore() > 1.0f) {

       scoreNorm = 1.0f / topDocs.getMaxScore();

    }

 

    int end = scoreDocs.length < length ? scoreDocs.length : length;

    for (int i = hitDocs.size(); i < end; i++) {

       hitDocs.addElement(new HitDoc(scoreDocs[i].score * scoreNorm,

              scoreDocs[i].doc));

    }

}

    我们先看一下 search函数:

public TopDocs search(Weight weight, Filter filter, final int nDocs)

       throws IOException {

 

    if (nDocs <= 0) // null might be returned from hq.top() below.

       throw new IllegalArgumentException("nDocs must be > 0");

 

    TopDocCollector collector = new TopDocCollector(nDocs);

    search(weight, filter, collector);

    return collector.topDocs();

}

    列出下一个search函数:

public void search(Weight weight, Filter filter, final HitCollector results) throws IOException {

    HitCollector collector = results;

    if (filter != null) {

       final BitSet bits = filter.bits(reader);

       collector = new HitCollector() {

           public final void collect(int doc, float score) {

              if (bits.get(doc)) { // skip docs not in bits

                  results.collect(doc, score);

              }

           }

       };

    }

 

    Scorer scorer = weight.scorer(reader);

    if (scorer == null)

       return;

    scorer.score(collector);

}

    因为我们没有用filter所以就直接跳过去了,来看scorer函数:

public Scorer scorer(IndexReader reader) throws IOException {

    TermDocs termDocs = reader.termDocs(term);

 

    if (termDocs == null)

       return null;

 

    return new TermScorer(this, termDocs, similarity, reader.norms(term

           .field()));

}

    我全部列出来算了,省点力气:

public TermDocs termDocs(Term term) throws IOException {

    TermDocs termDocs = termDocs();

    termDocs.seek(term);

    return termDocs;

}

public void seek(Term term) throws IOException {

    TermInfo ti = parent.tis.get(term);

    seek(ti);

}

void seek(TermInfo ti) throws IOException {

    count = 0;

    if (ti == null) {

       df = 0;

    } else {

       df = ti.docFreq;

       doc = 0;

       skipDoc = 0;

       skipCount = 0;

       numSkips = df / skipInterval;

       freqPointer = ti.freqPointer;

       proxPointer = ti.proxPointer;

       skipPointer = freqPointer + ti.skipOffset;

       freqStream.seek(freqPointer);

       haveSkipped = false;

    }

}

    这里我们看到了上次我们没有看到的skipIntervalnumSkips记录了要跳多少下。

    我们接着看scorer中的TermScorer构造函数:

TermScorer(Weight weight, TermDocs td, Similarity similarity, byte[] norms) {

    super(similarity);

    this.weight = weight;

    this.termDocs = td;

    this.norms = norms;

    this.weightValue = weight.getValue();

 

    for (int i = 0; i < SCORE_CACHE_SIZE; i++)

       scoreCache[i] = getSimilarity().tf(i) * weightValue;

}

    这里的tf是:

public float tf(float freq) {

    return (float) Math.sqrt(freq);

}

    scoreCache计算完之后相当于是一张表,从0SCORE_CACHE_SIZEtf最终得分都可以不用计算直接查表得到。

 

 

  评论这张
 
阅读(1220)| 评论(0)
推荐 转载

历史上的今天

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2017