注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

Koala++'s blog

计算广告学 RTB

 
 
 

日志

 
 

Lucene 2.4.1 源代码分析 DefaultSkipListWriter  

2009-09-24 09:32:01|  分类: Lucene |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |

       还是应该把DefaultSkipListWriter挑出来,另起一篇,这次以类以主,先读一下Javen的说明,例如,如果DocFreq=35并且SkipInterval=16,则在TermFreqs中有两个SkipData记录,容纳第15和第31个文档编号。第一个FreqSkip代表第16SkipDatumn起始的TermFreqs数据开始之后的字节数目,第二个FreqSkip表示第32SkipDatumn开始之后的字节数目。第一个ProxSkip代表第16SkipDatumn起始的Positions数据开始之后的字节数目,第二个ProxSkip表示第32SkipDatumn开始之后的字节数目。

  在Lucene 2.2版本中介绍了skip levels的想法(notion),每一个term可以有多个skip levels。一个termskip levels的数目等于NumSkipLevels = Min(MaxSkipLevels, floor(log(DocFreq/log(SkipInterval))))。对一个skip level来说SkipData记录的数目等于DocFreq/(SkipInterval^(Level + 1))。然而(whereas)最低的(lowestskip level等于Level = 0

  例如假设SkipInterval = 4, MaxSkipLevels = 2, DocFreq = 35,则skip level 08SkipData记录,在TermFreqs序列中包含第37111519232731个文档的编号。Skip level 1则有2SkipData记录,在TermFreqs中包含了第15和第31个文档的编号。

在所有level>0之上的SkipData记录中包含一个SkipChildLevelPointer,指向(referencinglevel-1中相应)(corresponding)的SkipData记录。在这个例子中,level 1中的记录15有一个指针指向level 0中的记录15level 1中的记录31有一个指针指向level 0中的记录31

DefaultSkipListWriter继承自MultiLevelListWriter类。先看MultiLevelListWriter类。它是一个抽象类,有三个成员变量:

// number of levels in this skip list

private int numberOfSkipLevels;

 

// the skip interval in the list with level = 0

private int skipInterval;

 

// for every skip level a different buffer is used

private RAMOutputStream[] skipBuffer;

       三个都是最核心的,skip List的级数,listlevel=0的时候的skip interval,对于每一个skip level有一个不同的buffer。它的构造函数:

protected MultiLevelSkipListWriter(int skipInterval, int maxSkipLevels,

       int df) {

    this.skipInterval = skipInterval;

 

    // calculate the maximum number of skip levels for

// this document frequency

    numberOfSkipLevels = df == 0 ? 0 : (int) Math.floor(Math.log(df)

           / Math.log(skipInterval));

 

    // make sure it does not exceed maxSkipLevels

    if (numberOfSkipLevels > maxSkipLevels) {

       numberOfSkipLevels = maxSkipLevels;

    }

}

       三个参数,skipInterval,最大的级数,文档次数。级数是log(df)/log(skipInterval),如果它大于指定的最大级数,则要等于最大级数。

protected void init() {

    skipBuffer = new RAMOutputStream[numberOfSkipLevels];

    for (int i = 0; i < numberOfSkipLevels; i++) {

       skipBuffer[i] = new RAMOutputStream();

    }

}

       这里就印证了上面所说的每一级对应一个元素的话,这里一个元素是RAMOutputStream

protected void resetSkip() {

    // creates new buffers or empties the existing ones

    if (skipBuffer == null) {

       init();

    } else {

       for (int i = 0; i < skipBuffer.length; i++) {

           skipBuffer[i].reset();

       }

    }

}

       这个没什么好讲的,无则init,有则reset

/**

 * Subclasses must implement the actual skip data encoding in this method.

 **/

protected abstract void writeSkipData(int level, IndexOutput skipBuffer)

       throws IOException;

       注释写到子类必须实现真正的skip data 写入在这个方法中。

/**

 * Writes the current skip data to the buffers. The current

* document frequency determines

 * the max level is skip data is to be written to.

*/

void bufferSkip(int df) throws IOException {

    int numLevels;

 

    // determine max level

    for (numLevels = 0; (df % skipInterval) == 0

           && numLevels < numberOfSkipLevels; df /= skipInterval) {

       numLevels++;

    }

 

    long childPointer = 0;

 

    for (int level = 0; level < numLevels; level++) {

       writeSkipData(level, skipBuffer[level]);

 

       long newChildPointer = skipBuffer[level].getFilePointer();

 

       if (level != 0) {

           // store child pointers for all levels except the lowest

           skipBuffer[level].writeVLong(childPointer);

       }

 

       //remember the childPointer for the next level

       childPointer = newChildPointer;

    }

}

       这里是确定有多少级,因为要是skipInterval的倍数的时候,级数是以skipInterval为底的对数,这里换了一种写法,就是乘多少次小于numberOfSkipLevels,并且还要余skipInterval0。函数writeSkipData写入跳表数据,而newChildPointer是现在这个level的指位置,如果不是第一级,写入它的子指针,并为下一级记录。

/**

 * Writes the buffered skip lists to the given output.

*/

long writeSkip(IndexOutput output) throws IOException {

    long skipPointer = output.getFilePointer();

    if (skipBuffer == null || skipBuffer.length == 0)

       return skipPointer;

 

    for (int level = numberOfSkipLevels - 1; level > 0; level--) {

       long length = skipBuffer[level].getFilePointer();

       if (length > 0) {

           output.writeVLong(length);

           skipBuffer[level].writeTo(output);

       }

    }

    skipBuffer[0].writeTo(output);

 

    return skipPointer;

}

       这个函数是写入buffered skip list,它是倒着写入的,先写入每一级的长度,再写入这一级的数据。最后写入第一级的数据。

       DefaultSkipListWriter的成员变量如下:

private int[] lastSkipDoc;

private int[] lastSkipPayloadLength;

private long[] lastSkipFreqPointer;

private long[] lastSkipProxPointer;

private IndexOutput freqOutput;

private IndexOutput proxOutput;

private int curDoc;

private boolean curStorePayloads;

private int curPayloadLength;

private long curFreqPointer;

private long curProxPointer;

       没有什么好讲的,都从名字可以看出来。它的构造函数如下:

DefaultSkipListWriter(int skipInterval, int numberOfSkipLevels,

       int docCount, IndexOutput freqOutput, IndexOutput proxOutput) {

    super(skipInterval, numberOfSkipLevels, docCount);

    this.freqOutput = freqOutput;

    this.proxOutput = proxOutput;

 

    lastSkipDoc = new int[numberOfSkipLevels];

    lastSkipPayloadLength = new int[numberOfSkipLevels];

    lastSkipFreqPointer = new long[numberOfSkipLevels];

    lastSkipProxPointer = new long[numberOfSkipLevels];

}

       这里lastSkipDoclastSkipPayloadLengthlastSkipFreqPointerlastSkipProxPointer也是每级有一个。

/**

 * Sets the values for the current skip data.

 */

void setSkipData(int doc, boolean storePayloads, int payloadLength) {

    this.curDoc = doc;

    this.curStorePayloads = storePayloads;

    this.curPayloadLength = payloadLength;

    this.curFreqPointer = freqOutput.getFilePointer();

    if (proxOutput != null)

       this.curProxPointer = proxOutput.getFilePointer();

}

       这个就是参数中的内容设为当前的。

protected void resetSkip() {

    super.resetSkip();

    Arrays.fill(lastSkipDoc, 0);

    // we don't have to write the first length in the skip list

    Arrays.fill(lastSkipPayloadLength, -1);

    Arrays.fill(lastSkipFreqPointer, freqOutput.getFilePointer());

    if (proxOutput != null)

       Arrays.fill(lastSkipProxPointer, proxOutput.getFilePointer());

}

       重置,也没什么好看的。下面把writeSkipData的注释先列出来:

// To efficiently store payloads in the posting lists we do

// not store the length of every payload. Instead we omit the

// length for a payload if  the previous payload had the same length.

// 为了使在posting lists保存payloads更有效,我们不保存每次payload的长度

// 取而代之的是对于一个payload如果和以前的payload有相同的长我们则省略

// However, in order to support skipping the payload length at

// every skip point  must be known. So we use the same length

// encoding that we use for the posting lists for the skip data as well:

// 但是,为了支持忽略payload长度对于每一个skip point必须知道,所以我们用相同

// 的长度编码

// 那么我们对于skip data 也用poasting list

// Case 1: current field does not store payloads

//           SkipDatum                 --> DocSkip, FreqSkip, ProxSkip

//           DocSkip,FreqSkip,ProxSkip --> VInt

//           DocSkip records the document number before every SkipInterval

//             th  document in TermFreqs.

//           Document numbers are represented as differences from the

//             previous value in the sequence.

// Case 1: 当前field 不保存payload

//           SkipDatum                 --> DocSkip, FreqSkip, ProxSkip

//           DocSkip,FreqSkip,ProxSkip --> VInt

//           DocSkip记录在TermFreqs中在每个skipInterval之前的document

//           Document号用与序列中以前值的差来表示

// Case 2: current field stores payloads

//           SkipDatum   --> DocSkip, PayloadLength?, FreqSkip,ProxSkip

//           DocSkip,FreqSkip,ProxSkip --> VInt

//           PayloadLength             --> VInt   

//         In this case DocSkip/2 is the difference between

//         the current and the previous value. If DocSkip

//         is odd, then a PayloadLength encoded as VInt follows,

//         if DocSkip is even, then it is assumed that the

//         current payload length equals the length at the previous

//         skip point

// Case 2: 当前field保存payload

//           SkipDatum    --> DocSkip, PayloadLength?, FreqSkip,ProxSkip

//           DocSkip,FreqSkip,ProxSkip --> VInt

//           PayloadLength             --> VInt  

//        在这种情况个 DocSkip/2是当前与前值的差,如果DocSkip是奇数

//        那么PayloadLength被编码为VInt,如果DocSkip是偶数

//        那么它假设当前的payload长度与前skip point的长度相等

       看完注释之后,现在开始看代码:

protected void writeSkipData(int level, IndexOutput skipBuffer)

       throws IOException {

    if (curStorePayloads) {

       int delta = curDoc - lastSkipDoc[level];

       if (curPayloadLength == lastSkipPayloadLength[level]) {

           // the current payload length equals the length

// at the previous skip point,

           // so we don't store the length again

           skipBuffer.writeVInt(delta * 2);

       } else {

           // the payload length is different from the previous one.

// We shift the DocSkip,

           // set the lowest bit and store the current payload

// length as VInt.

           skipBuffer.writeVInt(delta * 2 + 1);

           skipBuffer.writeVInt(curPayloadLength);

           lastSkipPayloadLength[level] = curPayloadLength;

       }

    } else {

       // current field does not store payloads

       skipBuffer.writeVInt(curDoc - lastSkipDoc[level]);

    }

    skipBuffer.writeVInt((int) (curFreqPointer

       lastSkipFreqPointer[level]));

    skipBuffer.writeVInt((int) (curProxPointer

       lastSkipProxPointer[level]));

 

    lastSkipDoc[level] = curDoc;

    //System.out.println("write doc at level " + level + ": " + curDoc);

 

    lastSkipFreqPointer[level] = curFreqPointer;

    lastSkipProxPointer[level] = curProxPointer;

}

       先看一下简单的,就是不保存payload的代码,即else中的代码,skipBuffer写入的是它与上一个skipDoc之间的差值,以VInt写入,而保存payload的代码,计算两个skip Doc之间的差值,已经说过了,可以和else中的合一起写,而如果当前的payload长度与前一个payload的长度一样,就不用写入长度,只是左移一位,也就是注释上写的:值的二分之一是文档间的差值,当然它是偶数,而下面是要写入长度的代码,左移1位后,将最右移置1,然后记录下现在payload的长度。

       再回顾一下FreqProxTermsWriter调用它的代码:

skipListWriter.resetSkip();

 

// Now termStates has numToMerge FieldMergeStates

// which all share the same term. Now we must

// interleave the docID streams.

while (numToMerge > 0) {

 

    if ((++df % skipInterval) == 0) {

       skipListWriter.setSkipData(lastDoc,

              currentFieldStorePayloads, lastPayloadLength);

       skipListWriter.bufferSkip(df);

    }

*********************************************************************

long skipPointer = skipListWriter.writeSkip(freqOut);

 

       先要重置Skip,如果dfskipInterval的倍数,那么设置Skip Data,再用bufferSkip写入RAM,最后用writeSkip写入文件。

 

 

 

 

 

  评论这张
 
阅读(1224)| 评论(0)
推荐 转载

历史上的今天

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2017