注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

Koala++'s blog

计算广告学 RTB

 
 
 

日志

 
 

Lucene源代码分析[15]  

2009-07-07 13:56:39|  分类: Lucene |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |

         今天测试的时候,发现还真的可能发生同时打开文件过多这种情况。所以我认为先看.cfs这个文件是有必要的,把测试代码换成:

IndexWriter writer = new IndexWriter( "E:\\a\\", new SimpleAnalyzer(), true);

    //writer.setUseCompoundFile( false ); 

    {

       Document doc1 = new Document();

       Field name1 = new Field( "TheField", "hello world",

           Field.Store.YES, Field.Index.TOKENIZED );

       doc1.add( name1 );

       Document doc2 = new Document();

       Field name2 = new Field( "TheField", "hello china",

              Field.Store.YES, Field.Index.TOKENIZED );

       doc2.add( name2 );

       Document doc3 = new Document();

       Field name3 = new Field( "TheField", "hello world",

              Field.Store.YES, Field.Index.TOKENIZED );

       doc3.add( name3 );

       writer.addDocument( doc1 );

       writer.addDocument( doc2 );

       writer.addDocument( doc3 );

    }

   

    writer.close();

}

我们看它是如何产生的,在IndexSearcher中的mergeSegments中的有一段我们没有认真看过:

if (useCompoundFile) {

    final Vector filesToDelete = merger.createCompoundFile(mergedName

           + ".tmp");

    synchronized (directory) { // in- & inter-process sync

       new Lock.With(directory.makeLock(COMMIT_LOCK_NAME),

              COMMIT_LOCK_TIMEOUT) {

           public Object doBody() throws IOException {

              // make compound file visible for SegmentReaders

              directory.renameFile(mergedName + ".tmp", mergedName

                     + ".cfs");

              // delete now unused files of segment

              deleteFiles(filesToDelete);

              return null;

           }

       }.run();

    }

}

    这里我加了3doc,我们看一下createCompoundFile

final Vector createCompoundFile(String fileName) throws IOException {

    CompoundFileWriter cfsWriter = new CompoundFileWriter(directory,

           fileName);

 

    Vector files = new Vector(IndexFileNames.COMPOUND_EXTENSIONS.length

           + fieldInfos.size());

 

    // Basic files

    for (int i = 0; i < IndexFileNames.COMPOUND_EXTENSIONS.length; i++)

 {

       files.add(segment + "."

+ IndexFileNames.COMPOUND_EXTENSIONS[i]);

    }

 

    // Field norm files

    for (int i = 0; i < fieldInfos.size(); i++) {

       FieldInfo fi = fieldInfos.fieldInfo(i);

       if (fi.isIndexed && !fi.omitNorms) {

           files.add(segment + ".f" + i);

       }

    }

 

    // Vector files

    if (fieldInfos.hasVectors()) {

       for (int i = 0; i < IndexFileNames.VECTOR_EXTENSIONS.length; i++) {

           files.add(segment + "."

+ IndexFileNames.VECTOR_EXTENSIONS[i]);

       }

    }

 

    // Now merge all added files

    Iterator it = files.iterator();

    while (it.hasNext()) {

       cfsWriter.addFile((String) it.next());

    }

 

    // Perform the merge

    cfsWriter.close();

 

    return files;

}

    看一下CompoundFileWriter的构造函数有没有什么特别的:

public CompoundFileWriter(Directory dir, String name) {

    if (dir == null)

        throw new NullPointerException("directory cannot be null");

    if (name == null)

        throw new NullPointerException("name cannot be null");

 

    directory = dir;

    fileName = name;

    ids = new HashSet();

    entries = new LinkedList();

}

    没什么特别的,我们向下看, files的大小为是COMPOUND_EXTENSIONS的长度,我们看一下它:

/** File extensions of old-style index files */

static final String COMPOUND_EXTENSIONS[] = new String[] { "fnm", "frq",

       "prx", "fdx", "fdt", "tii", "tis" };

    还有就是由fieldInfos的大小,这就是那扩展名为.f(N)的文件。再一下句,用file将所有这些要合并的文件,放到files里。再下来就是看有没有下面这些文件:

/** File extensions for term vector support */

static final String VECTOR_EXTENSIONS[] = new String[] { "tvx", "tvd",

       "tvf" };

    while循环把这些文件名加入到cfsWriter中:

public void addFile(String file) {

    if (merged)

        throw new IllegalStateException(

            "Can't add extensions after merge has been called");

 

    if (file == null)

        throw new NullPointerException(

            "file cannot be null");

 

    if (! ids.add(file))

        throw new IllegalArgumentException(

            "File " + file + " already added");

 

    FileEntry entry = new FileEntry();

    entry.file = file;

    entries.add(entry);

}

    其中的FileEntry是一个内部类:

private static final class FileEntry {

    /** source file */

    String file;

 

    /** temporary holder for the start of directory entry for this file */

    long directoryOffset;

 

    /** temporary holder for the start of this file's data section */

    long dataOffset;

}

    也没什么意思,也就是addFile还是只是将我们要合并的文件名保存到了entries中。我们明白了,还是要在close中处理:

public void close() throws IOException {

    if (merged)

        throw new IllegalStateException(

            "Merge already performed");

 

    if (entries.isEmpty())

        throw new IllegalStateException(

            "No entries to merge have been defined");

 

    merged = true;

 

    // open the compound stream

    IndexOutput os = null;

    try {

        os = directory.createOutput(fileName);

 

        // Write the number of entries

        os.writeVInt(entries.size());

 

        // Write the directory with all offsets at 0.

        // Remember the positions of directory entries so that we can

        // adjust the offsets later

        Iterator it = entries.iterator();

        while(it.hasNext()) {

            FileEntry fe = (FileEntry) it.next();

            fe.directoryOffset = os.getFilePointer();

            os.writeLong(0);    // for now

            os.writeString(fe.file);

        }

 

        // Open the files and copy their data into the stream.

        // Remember the locations of each file's data section.

        byte buffer[] = new byte[1024];

        it = entries.iterator();

        while(it.hasNext()) {

            FileEntry fe = (FileEntry) it.next();

            fe.dataOffset = os.getFilePointer();

            copyFile(fe, os, buffer);

        }

 

        // Write the data offsets into the directory of the compound stream

        it = entries.iterator();

        while(it.hasNext()) {

            FileEntry fe = (FileEntry) it.next();

            os.seek(fe.directoryOffset);

            os.writeLong(fe.dataOffset);

        }

 

        // Close the output stream. Set the os to null before trying to

        // close so that if an exception occurs during the close, the

        // finally clause below will not attempt to close the stream

        // the second time.

        IndexOutput tmp = os;

        os = null;

        tmp.close();

 

    } finally {

        if (os != null) try { os.close(); } catch (IOException e) { }

    }

}

    .cfs先把合并了多少文件写入,记录下在.cfs中的directoryOffset,然后先写一个0,再写入文件名。将下来的一个while把文件全部都写进去,并记录了它的dataOffset,在最后一个while中在directoryOffset的位置把dataOffset记录下来,以前记录的是0,现在换掉,我把.cfs列出来。

Offset      0  1  2  3  4  5  6  7   8  9  A  B  C  D  E  F

 

00000000   08 00 00 00 00 00 00 00  78 06 5F 33 2E 66 6E 6D   ........x._3.fnm

00000010   00 00 00 00 00 00 00 83  06 5F 33 2E 66 72 71 00   .......?_3.frq.

00000020   00 00 00 00 00 00 89 06  5F 33 2E 70 72 78 00 00   ......?_3.prx..

00000030   00 00 00 00 00 8F 06 5F  33 2E 66 64 78 00 00 00   .....?_3.fdx...

00000040   00 00 00 00 A7 06 5F 33  2E 66 64 74 00 00 00 00   ....?_3.fdt....

00000050   00 00 00 D4 06 5F 33 2E  74 69 69 00 00 00 00 00   ...?_3.tii.....

00000060   00 00 F3 06 5F 33 2E 74  69 73 00 00 00 00 00 00   ..?_3.tis......

00000070   01 28 05 5F 33 2E 66 30  01 08 54 68 65 46 69 65   .(._3.f0..TheFie

00000080   6C 64 01 03 01 03 03 01  05 01 00 00 00 01 01 00   ld..............

00000090   00 00 00 00 00 00 00 00  00 00 00 00 00 00 0F 00   ................

000000A0   00 00 00 00 00 00 1E 01  00 01 0B 68 65 6C 6C 6F   ...........hello

000000B0   20 77 6F 72 6C 64 01 00  01 0B 68 65 6C 6C 6F 20    world....hello

000000C0   63 68 69 6E 61 01 00 01  0B 68 65 6C 6C 6F 20 77   china....hello w

000000D0   6F 72 6C 64 FF FF FF FE  00 00 00 00 00 00 00 01   orld????.......

000000E0   00 00 00 80 00 00 00 10  00 00 FF FF FF FF 0F 00   ...?......????..

000000F0   00 00 14 FF FF FF FE 00  00 00 00 00 00 00 03 00   ...????........

00000100   00 00 80 00 00 00 10 00  05 63 68 69 6E 61 00 01   ..?......china..

00000110   00 00 00 05 68 65 6C 6C  6F 00 03 01 01 00 05 77   ....hello......w

00000120   6F 72 6C 64 00 02 03 03  79 79 79                  orld....yyy

    看一下第一个是08,也就是合并了8个文件,8个文件都可以从右边的数出来,接下来00 00 00 00 00 00 00 78表示_3.fnm数据从0x78开始写的,0x78也就是_3.f0后面的那个位置,然后写入文件名_3.fnm,以下的都不讲了,产生一次不用复合索引的,再产生一次用复合索引的,对照一下就可以了。

 

 

 

  评论这张
 
阅读(1094)| 评论(0)
推荐 转载

历史上的今天

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2017