注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

Koala++'s blog

计算广告学 RTB

 
 
 

日志

 
 

Lucene源代码分析[Field]  

2009-07-04 19:11:40|  分类: Lucene |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |

         Field类中有三个静态内部类,StoreIndex TermVector

         Store的代码如下:

public static final class Store extends Parameter implements Serializable {

 

    private Store(String name) {

       super(name);

    }

 

    public static final Store COMPRESS = new Store("COMPRESS");

 

    public static final Store YES = new Store("YES");

 

    public static final Store NO = new Store("NO");

}

    Store类的作用是指定是否Field应该被保存,以及如何保存(Specifies whether and how a field should be stored)Store.NO表示不在索引中保存Field的值(Do not store the field value in the index)Store.Yes表示在索引中保存原Field的值,这对于保存整的本文是有用的,比如应该在结果中显示的文档的标题,这个值应该是保存为它原来的形式,也就是在它保存之前不用analyzer进行处理(Store the original field value in the index. This is useful for short text like a document's title which should be displayed with the results. The value is stored in its original form, i.e. no analyzer is used before it is stored.)Store.COMPRESS表示以压缩的形式来在索引中保存Field值,这对于长文档和二进制Field是有用的。

    有一点值行我们注意的是,一般以为压缩是为了节约硬盘空间,但实际上它的作用远不止这一点,我这里摘译一点Introduction to information retrieval中的话。

    压缩的其中一个好处是很显然的,我们可以节约硬盘空间,压缩率达到1:4是很容易实现的,也就是保存索引的空间节约了25%。还有两个更微妙的好处,其中一个是增加了缓存的用途,如果使用压缩,我们可以保存更多的信息在主存中,另一个是使用压缩可以更快地将数据从硬盘读入内存,在现在的计算机硬件上,一个有效的压缩算法从硬盘中读取压缩数据后再解压,全部的时间往往比读取不压缩形式的数据要短。

    Index代码如下:

public static final class Index extends Parameter implements Serializable {

 

    private Index(String name) {

       super(name);

    }

 

    public static final Index NO = new Index("NO");

 

    public static final Index TOKENIZED = new Index("TOKENIZED");

 

    public static final Index UN_TOKENIZED = new Index("UN_TOKENIZED");

 

    public static final Index NO_NORMS = new Index("NO_NORMS");

 

}

    Index类的作用是指定是否Field应该被保存,以及如何保存(Specifies whether and how a field should be indexed)Index.NO表示这个Field不被索引,那么它也就不能被搜索,但是如果它被设置成Store.YES,它的值是可以得到的(Do not index the field value. This field can thus not be searched, but one can still access its contents provided it is Field.Store stored),这里没有Index.YES,因为如果你选择别的,它们都是被索引的,Index.TOKENIZER表示在保存到索引之前会用一个Analyzer对它进行分词,它对普通文本是有用的(Index the field's value so it can be searched. An Analyzer will be used to tokenize and possibly further normalize the text before its terms will be stored in the index. This is useful for common text)Index.Un_TOKENIZER表示在保存到索引之前不进行分词,它以一个单一的term形式保存,它对于保存ID这些是用的(Index the field's value without using an Analyzer, so it can be searched. As no analyzer is used the value will be stored as a single term. This is useful for unique Ids like product numbers.)Index.NO_NORMS表示不对Field值进行分词,也不进行评分,它可以节约内存,因为对于每一个被索引的Field的第个document都会占用一个字节(Index the field's value without an Analyzer, and disable the storing of norms.  No norms means that index-time boosting and field length normalization will be disabled.  The benefit is less memory usage as norms take up one byte per indexed field for every document in the index)

    TermVector的代码如下:

public static final class TermVector extends Parameter implements

       Serializable {

 

    private TermVector(String name) {

       super(name);

    }

 

    /** Do not store term vectors.

     */

    public static final TermVector NO = new TermVector("NO");

 

    /** Store the term vectors of each document. A term vector is a list

     * of the document's terms and their number of occurences in that

document. */

    public static final TermVector YES = new TermVector("YES");

 

    /**

     * Store the term vector + token position information

     */

    public static final TermVector WITH_POSITIONS = new TermVector(

           "WITH_POSITIONS");

 

    /**

     * Store the term vector + Token offset information

     */

    public static final TermVector WITH_OFFSETS = new TermVector(

           "WITH_OFFSETS");

 

    /**

     * Store the term vector + Token position and offset information

     */

    public static final TermVector WITH_POSITIONS_OFFSETS = new

         TermVector("WITH_POSITIONS_OFFSETS");

}

    TermVector表示是否用TermVector,以及TermVector中保存哪一些信息,TermVector的有什么用我也不太清楚,我搜索了一下,可能用途是:因为在检索的时候需要指定检索关键字,通过为一个Field添加一个TermVector,就可以在检索中把该Field检索到。它的构造函数:。

    DocumentWriterwritePostings函数中我们见过它,只是当时没有讲,它用termVectorWriter写入文件的。它会产生三个文件.tvx.tvd.tvf。我把writePosting的部分代码再列出来:

// check to see if we switched to a new field

String termField = posting.term.field();

if (currentField != termField) {

    // changing field - see if there is something to save

    currentField = termField;

    FieldInfo fi = fieldInfos.fieldInfo(currentField);

    if (fi.storeTermVector) {

       if (termVectorWriter == null) {

           termVectorWriter = new TermVectorsWriter(directory,

                  segment, fieldInfos);

           termVectorWriter.openDocument();

       }

       termVectorWriter.openField(currentField);

 

    } else if (termVectorWriter != null) {

       termVectorWriter.closeField();

    }

}

if (termVectorWriter != null && termVectorWriter.isFieldOpen()) {

    termVectorWriter.addTerm(posting.term.text(), postingFreq,

           posting.positions, posting.offsets);

}

    这段代码是在写入tiitis.frq.prx之后写入的,这里判断是不是Field转变了,然后把当前的FieldInfo保存起来,差不多可以猜测它就是可以在查询时提供Field信息。

public Field(String name, String value, Store store, Index index,

       TermVector termVector) {

    if (name == null)

       throw new NullPointerException("name cannot be null");

    if (value == null)

       throw new NullPointerException("value cannot be null");

    if (index == Index.NO && store == Store.NO)

       throw new IllegalArgumentException(

              "it doesn't make sense to have a field that "

                     + "is neither indexed nor stored");

    if (index == Index.NO && termVector != TermVector.NO)

        throw new IllegalArgumentException(

              "cannot store term vector information "

                     + "for a field that is not indexed");

 

    this.name = name.intern(); // field names are interned

    this.fieldsData = value;

 

    if (store == Store.YES) {

       this.isStored = true;

       this.isCompressed = false;

    } else if (store == Store.COMPRESS) {

       this.isStored = true;

       this.isCompressed = true;

    } else if (store == Store.NO) {

       this.isStored = false;

       this.isCompressed = false;

    } else

       throw new IllegalArgumentException("unknown store parameter "

              + store);

 

    if (index == Index.NO) {

       this.isIndexed = false;

       this.isTokenized = false;

    } else if (index == Index.TOKENIZED) {

       this.isIndexed = true;

       this.isTokenized = true;

    } else if (index == Index.UN_TOKENIZED) {

       this.isIndexed = true;

       this.isTokenized = false;

    } else if (index == Index.NO_NORMS) {

       this.isIndexed = true;

       this.isTokenized = false;

       this.omitNorms = true;

    } else {

       throw new IllegalArgumentException("unknown index parameter "

              + index);

    }

 

    this.isBinary = false;

 

    setStoreTermVector(termVector);

}

    构造函数前面判断还是有必要看一下的,特别是第三个和第四个,关于StoreIndex我们猜也猜的差不多了,只是有必要核实一下。

 

 

  评论这张
 
阅读(1191)| 评论(1)
推荐 转载

历史上的今天

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2017