注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

Koala++'s blog

计算广告学 RTB

 
 
 

日志

 
 

Lucene源代码分析[-5]  

2009-07-08 22:32:50|  分类: Lucene |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |

         现在看一个新的QueryBooleanQuery,修改测试代码如下:

public static void main( String[] args ) throws Exception

{

    IndexSearcher searcher = new IndexSearcher( "E:\\a" );

   

    Term t1 = new Term( "TheField", "hello" );

    Term t2 = new Term( "TheField", "world" );

   

    TermQuery q1 = new TermQuery( t1 );

    TermQuery q2 = new TermQuery( t2 );

   

    BooleanQuery query = new BooleanQuery();

   

    query.add( q1, BooleanClause.Occur.MUST );

    query.add( q2, BooleanClause.Occur.MUST );

   

    Hits hits = searcher.search( query );

   

    for( int i = 0; i < hits.length(); i++ )

    {

       System.out.println( hits.doc( i ) );

    }

   

    searcher.close();

}

    看一下query.add函数:

public void add(Query query, BooleanClause.Occur occur) {

    add(new BooleanClause(query, occur));

}

public void add(BooleanClause clause) {

    if (clauses.size() >= maxClauseCount)

       throw new TooManyClauses();

 

    clauses.addElement(clause);

}

    我们看到主要是由BooleanClauseVector来完成我们的query保存的,看一下BooleanClause的构造函数:

public BooleanClause(Query query, Occur occur) {

    this.query = query;

    this.occur = occur;

    setFields(occur);

}

    Occur是一个内部类,我们看一下:

/** Specifies how terms may occur in matching documents. */

public static final class Occur extends Parameter implements

       java.io.Serializable {

 

    /** Use this operator for terms that must appear in the matching

    documents. */

    public static final Occur MUST = new Occur("MUST");

 

    /** Use this operator for terms that should appear in the

     * matching documents. For a BooleanQuery with two SHOULD

     * subqueries, at least one of the queries must appear in the matching

 documents. */

    public static final Occur SHOULD = new Occur("SHOULD");

 

    /** Use this operator for terms that must not appear in the

     * matching documents. Note that it is not possible to search for

     * queries that only consist of a MUST_NOT query. */

    public static final Occur MUST_NOT = new Occur("MUST_NOT");

}

    与,或,非,注释写的很清楚,当然不看注释也应该完全明白是什么。

    上次我们执行到Queryweight中,有一个rewrite函数,当时我们的QueryTermQuery所以没有什么变化,这次我们看一下BooleanQuery是否有变化。

public Query rewrite(IndexReader reader) throws IOException {

    if (clauses.size() == 1) { // optimize 1-clause queries

       BooleanClause c = (BooleanClause) clauses.elementAt(0);

       if (!c.isProhibited()) { // just return clause

 

           Query query = c.getQuery().rewrite(reader); // rewrite first

 

           if (getBoost() != 1.0f) { // incorporate boost

              if (query == c.getQuery()) // if rewrite was no-op

                  query = (Query) query.clone();

              query.setBoost(getBoost() * query.getBoost());

           }

 

           return query;

       }

    }

 

    BooleanQuery clone = null; // recursively rewrite

    for (int i = 0; i < clauses.size(); i++) {

       BooleanClause c = (BooleanClause) clauses.elementAt(i);

       Query query = c.getQuery().rewrite(reader);

       if (query != c.getQuery()) { // clause rewrote: must clone

           if (clone == null)

              clone = (BooleanQuery) this.clone();

           clone.clauses.setElementAt(new BooleanClause(query, c

                  .getOccur()), i);

       }

    }

    if (clone != null) {

       return clone; // some clauses rewrote

    } else

       return this; // no clauses rewrote

}

    看到这里还是很让我们失望,因为我们在BooleanQuery中加入的是TermQuery,所以它的rewrite还是直接返回,没有什么变化。

    我们再看下面的一个函数createWeight

protected Weight createWeight(Searcher searcher) throws IOException {

 

    if (0 < minNrShouldMatch) {

       // :TODO: should we throw an exception if getUseScorer14 ?

       return new BooleanWeight2(searcher);

    }

 

    return getUseScorer14() ? (Weight) new BooleanWeight(searcher)

           : (Weight) new BooleanWeight2(searcher);

}

    minNrShould似乎是指定至少有多少是MUSTClause,往下看,默认是不使用1.4版的scorer,所以这里用的是BooleanWeight2

public BooleanWeight(Searcher searcher) throws IOException {

    this.similarity = getSimilarity(searcher);

    for (int i = 0; i < clauses.size(); i++) {

       BooleanClause c = (BooleanClause) clauses.elementAt(i);

       weights.add(c.getQuery().createWeight(searcher));

    }

}

         其中weights是一个Vector,它保存着BooleanQuery中的Queryweight。它的sumOfSquaredWeights

public float sumOfSquaredWeights() throws IOException {

    float sum = 0.0f;

    for (int i = 0; i < weights.size(); i++) {

       BooleanClause c = (BooleanClause) clauses.elementAt(i);

       Weight w = (Weight) weights.elementAt(i);

       if (!c.isProhibited())

           sum += w.sumOfSquaredWeights(); // sum sub weights

    }

 

    sum *= getBoost() * getBoost(); // boost each sub-weight

 

    return sum;

}

         只是将不是MUST_NOTqueryweight累加了起来。

         再看IndexSearchersearchweight.scorer这一句:

/** @return An alternative Scorer that uses and provides skipTo(),

 *          and scores documents in document number order.

 */

public Scorer scorer(IndexReader reader) throws IOException {

    BooleanScorer2 result = new BooleanScorer2(similarity,

           minNrShouldMatch);

 

    for (int i = 0; i < weights.size(); i++) {

       BooleanClause c = (BooleanClause) clauses.elementAt(i);

       Weight w = (Weight) weights.elementAt(i);

       Scorer subScorer = w.scorer(reader);

       if (subScorer != null)

           result.add(subScorer, c.isRequired(), c.isProhibited());

       else if (c.isRequired())

           return null;

    }

 

    return result;

}

         注意注释上的一句话,这里用到了skipTo函数,也就是我们总想看,却一直没见着的函数。我们看到也没有什么特殊的地方,还是累记每个weightscorer。但是如果累记的呢?

public void add(final Scorer scorer, boolean required, boolean prohibited) {

    if (!prohibited) {

       coordinator.maxCoord++;

    }

 

    if (required) {

       if (prohibited) {

           throw new IllegalArgumentException(

                  "scorer cannot be required and prohibited");

       }

       requiredScorers.add(scorer);

    } else if (prohibited) {

       prohibitedScorers.add(scorer);

    } else {

       optionalScorers.add(scorer);

    }

}

         这里的requiredScorersprohibitedScorersoptionalScorers都是ArrayList。看到还要需要别的函数来处理。这就是IndexSearcher中的search中的最后一句scorer.score

public void score(HitCollector hc) throws IOException {

    if (countingSumScorer == null) {

       initCountingSumScorer();

    }

    while (countingSumScorer.next()) {

       hc.collect(countingSumScorer.doc(), score());

    }

}

         看一下initCountingSumScorer

private void initCountingSumScorer() {

    coordinator.init();

    countingSumScorer = makeCountingSumScorer();

}

         再看coordinator这个Coordinator对象大概要做什么:

private class Coordinator {

    int maxCoord = 0; // to be increased for each non prohibited scorer

    private float[] coordFactors = null;

 

    void init() { // use after all scorers have been added.

       coordFactors = new float[maxCoord + 1];

       Similarity sim = getSimilarity();

       for (int i = 0; i <= maxCoord; i++) {

           coordFactors[i] = sim.coord(i, maxCoord);

       }

    }

 

    int nrMatchers; // to be increased by score() of match counting scorers.

 

    void initDoc() {

       nrMatchers = 0;

    }

 

    float coordFactor() {

       return coordFactors[nrMatchers];

    }

}

         刚才其实我们已经看到maxCoord了,不过没认真看,这里看一下注释,它表示所有SHOULDMUST设置的Queryscorer数量。而DefaultSimilarity中的coord表示匹配的词与总词数之间的比:

public float coord(int overlap, int maxOverlap) {

    return overlap / (float) maxOverlap;

}

  评论这张
 
阅读(1243)| 评论(2)
推荐 转载

历史上的今天

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2017