注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

Koala++'s blog

计算广告学 RTB

 
 
 

日志

 
 

Lucene源代码分析[6]  

2009-07-03 12:40:57|  分类: Lucene |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |

// invert doc into postingTable

postingTable.clear(); // clear postingTable

fieldLengths = new int[fieldInfos.size()]; // init fieldLengths

fieldPositions = new int[fieldInfos.size()]; // init fieldPositions

fieldOffsets = new int[fieldInfos.size()]; // init fieldOffsets

 

fieldBoosts = new float[fieldInfos.size()]; // init fieldBoosts

Arrays.fill(fieldBoosts, doc.getBoost());

 

invertDocument(doc);

    前面的后面有注释,只看最后一句invertDocument,这个函数直接用一个while写的,也不好分开。

// Tokenizes the fields of a document into Postings.

private final void invertDocument(Document doc) throws IOException {

    Enumeration fields = doc.fields();

    while (fields.hasMoreElements()) {

       Field field = (Field) fields.nextElement();

       String fieldName = field.name();

       int fieldNumber = fieldInfos.fieldNumber(fieldName);

 

       int length = fieldLengths[fieldNumber]; // length of field

       int position = fieldPositions[fieldNumber]; // position in field

       if (length > 0)

           position += analyzer.getPositionIncrementGap(fieldName);

       int offset = fieldOffsets[fieldNumber]; // offset field

 

       if (field.isIndexed()) {

           if (!field.isTokenized()) { // un-tokenized field

              String stringValue = field.stringValue();

              if (field.isStoreOffsetWithTermVector())

                  addPosition(fieldName, stringValue, position++,

                         new TermVectorOffsetInfo(offset, offset

                                + stringValue.length()));

              else

                  addPosition(fieldName, stringValue, position++,

                          null);

              offset += stringValue.length();

              length++;

           } else {

              Reader reader; // find or make Reader

              if (field.readerValue() != null)

                  reader = field.readerValue();

              else if (field.stringValue() != null)

                  reader = new StringReader(field.stringValue());

              else

                  throw new IllegalArgumentException(

                     "field must have either String or Reader value");

 

              // Tokenize field and add to postingTable

              TokenStream stream = analyzer

                     .tokenStream(fieldName, reader);

              try {

                  Token lastToken = null;

                  for (Token t = stream.next(); t != null; t = stream

                         .next()) {

                     position += (t.getPositionIncrement() - 1);

 

                     if (field.isStoreOffsetWithTermVector())

                         addPosition(fieldName, t.termText(),

                                position++, new TermVectorOffsetInfo(

                                       offset + t.startOffset(),

                                       offset + t.endOffset()));

                     else

                         addPosition(fieldName, t.termText(),

                                position++, null);

 

                     lastToken = t;

                     if (++length > maxFieldLength) {

                         if (infoStream != null)

                            infoStream.println("maxFieldLength "

                                          + maxFieldLength

                                          + " reached, ignoring

following tokens");

                         break;

                     }

                  }

 

                  if (lastToken != null)

                     offset += lastToken.endOffset() + 1;

 

              } finally {

                  stream.close();

              }

           }

 

           fieldLengths[fieldNumber] = length; // save field length

           fieldPositions[fieldNumber] = position; // save field position

           fieldBoosts[fieldNumber] *= field.getBoost();

           fieldOffsets[fieldNumber] = offset;

       }

    }

}

    一个对所有Field的循环,先得到Field的名称,序号,长度,位置,偏移。接下来,如果这个Field被索引,我们才会处理,当然了,不索引,我们也用不着做什么处理。

    再看下一个if,如果不被分词,至于里面的TermVector,我们暂时不去理会,看到addPosition这一句,这个函数是把一个词加入postingTable中,这里因为不对Field值进行分词,那么就把字符串整个加进去。

    Else中是加入了分词处理,我们先得到Field的值,你可以去Field中看一下,它的成员变量fieldsData,它的注释是唯一的Field值变量,不管是什么类别的(the one and only data object for all different kind of field values)。所以开始还要判断一下它的类型。代码中用tokenStream进行分词,同样改变positionlengthoffset

private final void addPosition(String field, String text, int position,

       TermVectorOffsetInfo offset) {

    termBuffer.set(field, text);

    //System.out.println("Offset: " + offset);

    Posting ti = (Posting) postingTable.get(termBuffer);

    if (ti != null) { // word seen before

       int freq = ti.freq;

       if (ti.positions.length == freq) { // positions array is full

           int[] newPositions = new int[freq * 2]; // double size

           int[] positions = ti.positions;

           for (int i = 0; i < freq; i++)

              // copy old positions to new

              newPositions[i] = positions[i];

           ti.positions = newPositions;

       }

       ti.positions[freq] = position; // add new position

 

       if (offset != null) {

           if (ti.offsets.length == freq) {

              TermVectorOffsetInfo[] newOffsets = new

                  TermVectorOffsetInfo[freq * 2];

              TermVectorOffsetInfo[] offsets = ti.offsets;

              for (int i = 0; i < freq; i++) {

                  newOffsets[i] = offsets[i];

              }

              ti.offsets = newOffsets;

           }

           ti.offsets[freq] = offset;

       }

       ti.freq = freq + 1; // update frequency

    } else { // word not seen before

       Term term = new Term(field, text, false);

       postingTable.put(term, new Posting(term, position, offset));

    }

}

    termBuffer是一个Term对象,先设置它的域和值,然后我们看是否这个term已经存在于postingTable中,第一个if是这个term出现过,那么我们得到它的出现的次数freq,如果位置数组已经满了,那么就把它的容量扩大一倍,最后加入这个位置,偏移信息的处理方式与位置信息的处理方式相似,如果我们没有见过这个term那么我们就产生一个新的Posting记录这个term的有关信息。  

    下面把Posting类的代码列出来了,看起来也没什么特别的。

final class Posting { // info about a Term in a doc

    Term term; // the Term

    int freq; // its frequency in doc

    int[] positions; // positions it occurs at

    TermVectorOffsetInfo[] offsets;

 

    Posting(Term t, int position, TermVectorOffsetInfo offset) {

       term = t;

       freq = 1;

       positions = new int[1];

       positions[0] = position;

       if (offset != null) {

           offsets = new TermVectorOffsetInfo[1];

           offsets[0] = offset;

       } else

           offsets = null;

    }

}

 

 

 

 

  评论这张
 
阅读(1142)| 评论(0)
推荐 转载

历史上的今天

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2017