今天测试的时候,发现还真的可能发生同时打开文件过多这种情况。所以我认为先看.cfs这个文件是有必要的,把测试代码换成:
IndexWriter writer = new IndexWriter( "E:\\a\\", new SimpleAnalyzer(), true);
//writer.setUseCompoundFile( false );
{
Document doc1 = new Document();
Field name1 = new Field( "TheField", "hello world",
Field.Store.YES, Field.Index.TOKENIZED );
doc1.add( name1 );
Document doc2 = new Document();
Field name2 = new Field( "TheField", "hello china",
Field.Store.YES, Field.Index.TOKENIZED );
doc2.add( name2 );
Document doc3 = new Document();
Field name3 = new Field( "TheField", "hello world",
Field.Store.YES, Field.Index.TOKENIZED );
doc3.add( name3 );
writer.addDocument( doc1 );
writer.addDocument( doc2 );
writer.addDocument( doc3 );
}
writer.close();
}
我们看它是如何产生的,在IndexSearcher中的mergeSegments中的有一段我们没有认真看过:
if (useCompoundFile) {
final Vector filesToDelete = merger.createCompoundFile(mergedName
+ ".tmp");
synchronized (directory) { // in- & inter-process sync
new Lock.With(directory.makeLock(COMMIT_LOCK_NAME),
COMMIT_LOCK_TIMEOUT) {
public Object doBody() throws IOException {
// make compound file visible for SegmentReaders
directory.renameFile(mergedName + ".tmp", mergedName
+ ".cfs");
// delete now unused files of segment
deleteFiles(filesToDelete);
return null;
}
}.run();
}
}
这里我加了3个doc,我们看一下createCompoundFile:
final Vector createCompoundFile(String fileName) throws IOException {
CompoundFileWriter cfsWriter = new CompoundFileWriter(directory,
fileName);
Vector files = new Vector(IndexFileNames.COMPOUND_EXTENSIONS.length
+ fieldInfos.size());
// Basic files
for (int i = 0; i < IndexFileNames.COMPOUND_EXTENSIONS.length; i++)
{
files.add(segment + "."
+ IndexFileNames.COMPOUND_EXTENSIONS[i]);
}
// Field norm files
for (int i = 0; i < fieldInfos.size(); i++) {
FieldInfo fi = fieldInfos.fieldInfo(i);
if (fi.isIndexed && !fi.omitNorms) {
files.add(segment + ".f" + i);
}
}
// Vector files
if (fieldInfos.hasVectors()) {
for (int i = 0; i < IndexFileNames.VECTOR_EXTENSIONS.length; i++) {
files.add(segment + "."
+ IndexFileNames.VECTOR_EXTENSIONS[i]);
}
}
// Now merge all added files
Iterator it = files.iterator();
while (it.hasNext()) {
cfsWriter.addFile((String) it.next());
}
// Perform the merge
cfsWriter.close();
return files;
}
看一下CompoundFileWriter的构造函数有没有什么特别的:
public CompoundFileWriter(Directory dir, String name) {
if (dir == null)
throw new NullPointerException("directory cannot be null");
if (name == null)
throw new NullPointerException("name cannot be null");
directory = dir;
fileName = name;
ids = new HashSet();
entries = new LinkedList();
}
没什么特别的,我们向下看, files的大小为是COMPOUND_EXTENSIONS的长度,我们看一下它:
/** File extensions of old-style index files */
static final String COMPOUND_EXTENSIONS[] = new String[] { "fnm", "frq",
"prx", "fdx", "fdt", "tii", "tis" };
还有就是由fieldInfos的大小,这就是那扩展名为.f(N)的文件。再一下句,用file将所有这些要合并的文件,放到files里。再下来就是看有没有下面这些文件:
/** File extensions for term vector support */
static final String VECTOR_EXTENSIONS[] = new String[] { "tvx", "tvd",
"tvf" };
而while循环把这些文件名加入到cfsWriter中:
public void addFile(String file) {
if (merged)
throw new IllegalStateException(
"Can't add extensions after merge has been called");
if (file == null)
throw new NullPointerException(
"file cannot be null");
if (! ids.add(file))
throw new IllegalArgumentException(
"File " + file + " already added");
FileEntry entry = new FileEntry();
entry.file = file;
entries.add(entry);
}
其中的FileEntry是一个内部类:
private static final class FileEntry {
/** source file */
String file;
/** temporary holder for the start of directory entry for this file */
long directoryOffset;
/** temporary holder for the start of this file's data section */
long dataOffset;
}
也没什么意思,也就是addFile还是只是将我们要合并的文件名保存到了entries中。我们明白了,还是要在close中处理:
public void close() throws IOException {
if (merged)
throw new IllegalStateException(
"Merge already performed");
if (entries.isEmpty())
throw new IllegalStateException(
"No entries to merge have been defined");
merged = true;
// open the compound stream
IndexOutput os = null;
try {
os = directory.createOutput(fileName);
// Write the number of entries
os.writeVInt(entries.size());
// Write the directory with all offsets at 0.
// Remember the positions of directory entries so that we can
// adjust the offsets later
Iterator it = entries.iterator();
while(it.hasNext()) {
FileEntry fe = (FileEntry) it.next();
fe.directoryOffset = os.getFilePointer();
os.writeLong(0); // for now
os.writeString(fe.file);
}
// Open the files and copy their data into the stream.
// Remember the locations of each file's data section.
byte buffer[] = new byte[1024];
it = entries.iterator();
while(it.hasNext()) {
FileEntry fe = (FileEntry) it.next();
fe.dataOffset = os.getFilePointer();
copyFile(fe, os, buffer);
}
// Write the data offsets into the directory of the compound stream
it = entries.iterator();
while(it.hasNext()) {
FileEntry fe = (FileEntry) it.next();
os.seek(fe.directoryOffset);
os.writeLong(fe.dataOffset);
}
// Close the output stream. Set the os to null before trying to
// close so that if an exception occurs during the close, the
// finally clause below will not attempt to close the stream
// the second time.
IndexOutput tmp = os;
os = null;
tmp.close();
} finally {
if (os != null) try { os.close(); } catch (IOException e) { }
}
}
.cfs先把合并了多少文件写入,记录下在.cfs中的directoryOffset,然后先写一个0,再写入文件名。将下来的一个while把文件全部都写进去,并记录了它的dataOffset,在最后一个while中在directoryOffset的位置把dataOffset记录下来,以前记录的是0,现在换掉,我把.cfs列出来。
Offset 0 1 2 3 4 5 6 7 8 9 A B C D E F
00000000 08 00 00 00 00 00 00 00 78 06 5F 33 2E 66 6E 6D ........x._3.fnm
00000010 00 00 00 00 00 00 00 83 06 5F 33 2E 66 72 71 00 .......?_3.frq.
00000020 00 00 00 00 00 00 89 06 5F 33 2E 70 72 78 00 00 ......?_3.prx..
00000030 00 00 00 00 00 8F 06 5F 33 2E 66 64 78 00 00 00 .....?_3.fdx...
00000040 00 00 00 00 A7 06 5F 33 2E 66 64 74 00 00 00 00 ....?_3.fdt....
00000050 00 00 00 D4 06 5F 33 2E 74 69 69 00 00 00 00 00 ...?_3.tii.....
00000060 00 00 F3 06 5F 33 2E 74 69 73 00 00 00 00 00 00 ..?_3.tis......
00000070 01 28 05 5F 33 2E 66 30 01 08 54 68 65 46 69 65 .(._3.f0..TheFie
00000080 6C 64 01 03 01 03 03 01 05 01 00 00 00 01 01 00 ld..............
00000090 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0F 00 ................
000000A0 00 00 00 00 00 00 1E 01 00 01 0B 68 65 6C 6C 6F ...........hello
000000B0 20 77 6F 72 6C 64 01 00 01 0B 68 65 6C 6C 6F 20 world....hello
000000C0 63 68 69 6E 61 01 00 01 0B 68 65 6C 6C 6F 20 77 china....hello w
000000D0 6F 72 6C 64 FF FF FF FE 00 00 00 00 00 00 00 01 orld????.......
000000E0 00 00 00 80 00 00 00 10 00 00 FF FF FF FF 0F 00 ...?......????..
000000F0 00 00 14 FF FF FF FE 00 00 00 00 00 00 00 03 00 ...????........
00000100 00 00 80 00 00 00 10 00 05 63 68 69 6E 61 00 01 ..?......china..
00000110 00 00 00 05 68 65 6C 6C 6F 00 03 01 01 00 05 77 ....hello......w
00000120 6F 72 6C 64 00 02 03 03 79 79 79 orld....yyy
看一下第一个是08,也就是合并了8个文件,8个文件都可以从右边的数出来,接下来00 00 00 00 00 00 00 78表示_3.fnm数据从0x78开始写的,0x78也就是_3.f0后面的那个位置,然后写入文件名_3.fnm,以下的都不讲了,产生一次不用复合索引的,再产生一次用复合索引的,对照一下就可以了。
评论