注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

Koala++'s blog

计算广告学 RTB

 
 
 

日志

 
 

Lucene源代码分析[12]  

2009-07-05 09:21:50|  分类: Lucene |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |

一个一个地看:

// No compound file exists - use the multi-file format

fieldInfos = new FieldInfos(cfsDir, segment + ".fnm");

fieldsReader = new FieldsReader(cfsDir, segment, fieldInfos);

在我们的例子中,cfsDirRAMDirectory,看一下构造函数:

FieldInfos(Directory d, String name) throws IOException {

    IndexInput input = d.openInput(name);

    try {

       read(input);

    } finally {

       input.close();

    }

}

public IndexInput openInput(String name) throws IOException {

    // default implementation for back compatibility

    // this method should be abstract

    return (IndexInput) openFile(name);

}

public final IndexInput openInput(String name) {

    RAMFile file = (RAMFile) files.get(name);

    return new RAMInputStream(file);

}

    不想随便乱解释,先看一下就好了。还是看read函数:

private void read(IndexInput input) throws IOException {

    int size = input.readVInt();//read in the size

    for (int i = 0; i < size; i++) {

       String name = input.readString().intern();

       byte bits = input.readByte();

       boolean isIndexed = (bits & IS_INDEXED) != 0;

       boolean storeTermVector = (bits & STORE_TERMVECTOR) != 0;

       boolean storePositionsWithTermVector = (bits &

           STORE_POSITIONS_WITH_TERMVECTOR) != 0;

       boolean storeOffsetWithTermVector = (bits &

           STORE_OFFSET_WITH_TERMVECTOR) != 0;

       boolean omitNorms = (bits & OMIT_NORMS) != 0;

 

       addInternal(name, isIndexed, storeTermVector,

              storePositionsWithTermVector,

              storeOffsetWithTermVector,

              omitNorms);

    }

}

    这个刚好是和write函数对应的(这样讲有点可笑)Read函数先读取有多少个Field,然后循环读出每一个Field名字,和对这个Field的设置,write函数用或运算进行保存,read函数用并运算进行判断,writeaddInternal都看过,我这里只是列出来,不讲了。

public void write(IndexOutput output) throws IOException {

    output.writeVInt(size());

    for (int i = 0; i < size(); i++) {

       FieldInfo fi = fieldInfo(i);

       byte bits = 0x0;

       if (fi.isIndexed)

           bits |= IS_INDEXED;

       if (fi.storeTermVector)

           bits |= STORE_TERMVECTOR;

       if (fi.storePositionWithTermVector)

           bits |= STORE_POSITIONS_WITH_TERMVECTOR;

       if (fi.storeOffsetWithTermVector)

           bits |= STORE_OFFSET_WITH_TERMVECTOR;

       if (fi.omitNorms)

           bits |= OMIT_NORMS;

       output.writeString(fi.name);

       output.writeByte(bits);

    }

}

private void addInternal(String name, boolean isIndexed,

       boolean storeTermVector, boolean storePositionWithTermVector,

       boolean storeOffsetWithTermVector, boolean omitNorms) {

    FieldInfo fi = new FieldInfo(name, isIndexed, byNumber.size(),

           storeTermVector, storePositionWithTermVector,

           storeOffsetWithTermVector, omitNorms);

    byNumber.add(fi);

    byName.put(name, fi);

}

    下一个是FieldsReader构造函数:

FieldsReader(Directory d, String segment, FieldInfos fn) throws IOException {

    fieldInfos = fn;

 

    fieldsStream = d.openInput(segment + ".fdt");

    indexStream = d.openInput(segment + ".fdx");

 

    size = (int) (indexStream.length() / 8);

}

    我在列出FieldReaderdoc函数之间列出我们已经看过的FieldWriteaddDocument函数:

final void addDocument(Document doc) throws IOException {

    indexStream.writeLong(fieldsStream.getFilePointer());

 

    int storedCount = 0;

    Enumeration fields = doc.fields();

    while (fields.hasMoreElements()) {

        Field field = (Field) fields.nextElement();

        if (field.isStored())

            storedCount++;

    }

    fieldsStream.writeVInt(storedCount);

 

    fields = doc.fields();

    while (fields.hasMoreElements()) {

        Field field = (Field) fields.nextElement();

        if (field.isStored()) {

           

fieldsStream.writeVInt(

fieldInfos.fieldNumber(field.name()));

 

            byte bits = 0;

            if (field.isTokenized())

                bits |= FieldsWriter.FIELD_IS_TOKENIZED;

            if (field.isBinary())

                bits |= FieldsWriter.FIELD_IS_BINARY;

            if (field.isCompressed())

                bits |= FieldsWriter.FIELD_IS_COMPRESSED;

           

            fieldsStream.writeByte(bits);

           

            if (field.isCompressed()) {

              // compression is enabled for the current field

              byte[] data = null;

              // check if it is a binary field

              if (field.isBinary()) {

                data = compress(field.binaryValue());

              }

              else {

                data = compress(field.stringValue().getBytes("UTF-8"));

              }

              final int len = data.length;

              fieldsStream.writeVInt(len);

              fieldsStream.writeBytes(data, len);

            }

            else {

              // compression is disabled for the current field

              if (field.isBinary()) {

                byte[] data = field.binaryValue();

                final int len = data.length;

                fieldsStream.writeVInt(len);

                fieldsStream.writeBytes(data, len);

              }

              else {

                fieldsStream.writeString(field.stringValue());

              }

            }

        }

    }

}

    我们先回忆一下,.fdx先写入.fdt文件指针的位置,.fdt再写入有多少个FieldStore.YES的,然后循环写入每个Field的名字序号,和保存的方式,最后根据设置的是压缩格式,还是二进制方式,就是原来的格式写入。那我们来看一下FieldsReader中是如果读取的

    .fdx先指到n*8L的位置,这是因为每次.fdx写入的是一个long(注意用的是writeLong,不是writeVLong),然后再读取第nDocument.fdt中的位置,而.fdt也就指向这个位置,与上面对应,先读取有多少Fields要被设为Store.YES,然后读取这个Field的序数,以下的代码看起来乱糟糟的,当然它也是对应的。

final Document doc(int n) throws IOException {

    indexStream.seek(n * 8L);

    long position = indexStream.readLong();

    fieldsStream.seek(position);

 

    Document doc = new Document();

    int numFields = fieldsStream.readVInt();

    for (int i = 0; i < numFields; i++) {

       int fieldNumber = fieldsStream.readVInt();

       FieldInfo fi = fieldInfos.fieldInfo(fieldNumber);

 

       byte bits = fieldsStream.readByte();

 

       boolean compressed = (bits & FieldsWriter.FIELD_IS_COMPRESSED) !=

 0;

       boolean tokenize = (bits & FieldsWriter.FIELD_IS_TOKENIZED) != 0;

 

       if ((bits & FieldsWriter.FIELD_IS_BINARY) != 0) {

           final byte[] b = new byte[fieldsStream.readVInt()];

           fieldsStream.readBytes(b, 0, b.length);

           if (compressed)

              doc.add(new Field(fi.name, uncompress(b),

                     Field.Store.COMPRESS));

           else

              doc.add(new Field(fi.name, b, Field.Store.YES));

       } else {

           Field.Index index;

           Field.Store store = Field.Store.YES;

 

           if (fi.isIndexed && tokenize)

              index = Field.Index.TOKENIZED;

           else if (fi.isIndexed && !tokenize)

              index = Field.Index.UN_TOKENIZED;

           else

              index = Field.Index.NO;

 

           Field.TermVector termVector = null;

           if (fi.storeTermVector) {

              if (fi.storeOffsetWithTermVector) {

                  if (fi.storePositionWithTermVector) {

                     termVector =

Field.TermVector.WITH_POSITIONS_OFFSETS;

                  } else {

                     termVector = Field.TermVector.WITH_OFFSETS;

                  }

              } else if (fi.storePositionWithTermVector) {

                  termVector = Field.TermVector.WITH_POSITIONS;

              } else {

                  termVector = Field.TermVector.YES;

              }

           } else {

              termVector = Field.TermVector.NO;

           }

 

           if (compressed) {

              store = Field.Store.COMPRESS;

              final byte[] b = new byte[fieldsStream.readVInt()];

              fieldsStream.readBytes(b, 0, b.length);

              Field f = new Field(fi.name, // field name

                     new String(uncompress(b), "UTF-8"),

                     store, index, termVector);

              f.setOmitNorms(fi.omitNorms);

              doc.add(f);

           } else {

              Field f = new Field(fi.name, // name

                     fieldsStream.readString(), // read value

                     store, index, termVector);

              f.setOmitNorms(fi.omitNorms);

              doc.add(f);

           }

       }

    }

 

    return doc;

}

    压缩与解压是用的java中的类来完成的,是DeflaterInFlater类,还是看代码:

private final byte[] compress(byte[] input) {

 

    // Create the compressor with highest level of compression

    Deflater compressor = new Deflater();

    compressor.setLevel(Deflater.BEST_COMPRESSION);

 

    // Give the compressor the data to compress

    compressor.setInput(input);

    compressor.finish();

 

    /*

     * Create an expandable byte array to hold the compressed data.

     * You cannot use an array that's the same size as the orginal because

     * there is no guarantee that the compressed data will be smaller than

     * the uncompressed data.

     */

    ByteArrayOutputStream bos = new ByteArrayOutputStream(input.length);

 

    // Compress the data

    byte[] buf = new byte[1024];

    while (!compressor.finished()) {

       int count = compressor.deflate(buf);

       bos.write(buf, 0, count);

    }

 

    compressor.end();

 

    // Get the compressed data

    return bos.toByteArray();

}

private final byte[] uncompress(final byte[] input) throws IOException {

 

    Inflater decompressor = new Inflater();

    decompressor.setInput(input);

 

    // Create an expandable byte array to hold the decompressed data

    ByteArrayOutputStream bos = new ByteArrayOutputStream(input.length);

 

    // Decompress the data

    byte[] buf = new byte[1024];

    while (!decompressor.finished()) {

       try {

           int count = decompressor.inflate(buf);

           bos.write(buf, 0, count);

       } catch (DataFormatException e) {

           // this will happen if the field is not compressed

           throw new IOException("field data are in wrong format: "

                  + e.toString());

       }

    }

 

    decompressor.end();

 

    // Get the decompressed data

    return bos.toByteArray();

}

    注释中也提到了,采用的是highest level of compression,这个还有一些别的选项,有兴趣可以看一下。

         另外,以二进制形式保存,用的是writeBytes函数,这个函数我们也没有看见,它最后调用的是BufferedIndexOutput中的writeBytes

public void writeBytes(byte[] b, int length) throws IOException {

    int bytesLeft = BUFFER_SIZE - bufferPosition;

    // is there enough space in the buffer?

    if (bytesLeft >= length) {

       // we add the data to the end of the buffer

       System.arraycopy(b, 0, buffer, bufferPosition, length);

       bufferPosition += length;

       // if the buffer is full, flush it

       if (BUFFER_SIZE - bufferPosition == 0)

           flush();

    } else {

       // is data larger then buffer?

       if (length > BUFFER_SIZE) {

           // we flush the buffer

           if (bufferPosition > 0)

              flush();

           // and write data at once

           flushBuffer(b, length);

       } else {

           // we fill/flush the buffer (until the input is written)

           int pos = 0; // position in the input data

           int pieceLength;

           while (pos < length) {

              pieceLength = (length - pos < bytesLeft) ? length - pos

                     : bytesLeft;

              System.arraycopy(b, pos, buffer, bufferPosition,

                     pieceLength);

              pos += pieceLength;

              bufferPosition += pieceLength;

              // if the buffer is full, flush it

              bytesLeft = BUFFER_SIZE - bufferPosition;

              if (bytesLeft == 0) {

                  flush();

                  bytesLeft = BUFFER_SIZE;

              }

           }

       }

    }

}

         第一个if没什么说的,就是buffer中还可以保存的了这次要保存的b数组中的全部元素,else中的if是,如果要保存的b长度比buffer的长度还要长,那么没有必要缓存,直接写入就可以了。Else中的while最多执行两次,别糊涂了,pieceLength的意思很简单,就是要写的比剩余的空间还要多的话,就先写一部分,反之,那就全部写进去。

 

  评论这张
 
阅读(1225)| 评论(0)
推荐 转载

历史上的今天

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2017