Skip to content

lucene源码分析

Posted on:October 21, 2021 at 01:01 PM

lucene 分为两部分:

Vint

vint 是一个可变长的数组,是一个小端的变长数组,每个字节最高位置1代表后面还有(也就是最后一个字节的最高位是0)

相关代码

      IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
      iwc.setUseCompoundFile(false);  // 生成多个文件

开始debug

### 调试java 代码

 java -agentlib:jdwp=transport=dt_socket,server=y,address=8000 -cp  ./lucene-demo-9.1.0-SNAPSHOT.jar:/home/ubuntu/lucene-9.1.0/lucene/core/build/libs/lucene-core-9.1.0-SNAPSHOT.jar:/home/ubuntu/lucene-9.1.0/lucene/queryparser/build/libs/lucene-queryparser-9.1.0-SNAPSHOT.jar    org.apache.lucene.demo.SearchFiles

### jdb 连接上jdk
jdb -attach 8000 -sourcepath /home/ubuntu/lucene-9.1.0/lucene/demo/src/java/

查看fdt文件

hexdump -C _0.fdt
00000000  3f d7 6c 17 1c 4c 75 63  65 6e 65 39 30 53 74 6f  |?.l..Lucene90Sto|
00000010  72 65 64 46 69 65 6c 64  73 46 61 73 74 44 61 74  |redFieldsFastDat|
00000020  61 00 00 00 01 85 88 12  2b 0c 73 6b 95 30 38 76  |a.......+.sk.08v|
00000030  c9 0a 2a 52 29 00 00 0a  00 01 00 1c 02 06 03 07  |..*R)...........|
00000040  07 07 07 07 07 07 07 07  20 00 1a 60 2f 68 6f 6d  |........ ..`/hom|
00000050  65 2f 60 75 62 75 6e 74  75 60 2f 64 6f 63 2f 6d  |e/`ubuntu`/doc/m|
00000060  60 6f 6e 67 6f 2e 74 60  78 74 00 1a 2f 68 60 6f  |`ongo.t`xt../h`o|
00000070  6d 65 2f 75 62 60 75 6e  74 75 2f 64 60 6f 63 2f  |me/ub`untu/d`oc/|
00000080  68 65 6c 60 6c 6f 2e 74  78 74 c0 28 93 e8 00 00  |hel`lo.txt.(....|
00000090  00 00 00 00 00 00 c8 75  0a 41                    |.......u.A|
0000009a

writeField

ubuntu@VM-0-3-ubuntu:~$ jdb -attach 8000 -sourcepath /home/ubuntu/lucene-9.1.0/lucene/demo/src/java/:/home/ubuntu/lucene-9.1.0/lucene/core/src/java/ 
Set uncaught java.lang.Throwable
Set deferred uncaught java.lang.Throwable
Initializing jdb ...
> 
VM Started: No frames on the current call stack

main[1] stop in  org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsWriter.writeField
Deferring breakpoint org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsWriter.writeField.
It will be set after the class is loaded.
main[1] cont
> Set deferred breakpoint org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsWriter.writeField

Breakpoint hit: "thread=main", org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsWriter.writeField(), line=276 bci=0
276        ++numStoredFieldsInDoc;

main[1] wheree^H^H
Unrecognized command: 'wher'.  Try help...
main[1] where
  [1] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsWriter.writeField (Lucene90CompressingStoredFieldsWriter.java:276)
  [2] org.apache.lucene.index.StoredFieldsConsumer.writeField (StoredFieldsConsumer.java:65)
  [3] org.apache.lucene.index.IndexingChain.processField (IndexingChain.java:749)
  [4] org.apache.lucene.index.IndexingChain.processDocument (IndexingChain.java:620)
  [5] org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments (DocumentsWriterPerThread.java:241)
  [6] org.apache.lucene.index.DocumentsWriter.updateDocuments (DocumentsWriter.java:432)
  [7] org.apache.lucene.index.IndexWriter.updateDocuments (IndexWriter.java:1,531)
  [8] org.apache.lucene.index.IndexWriter.updateDocument (IndexWriter.java:1,816)
  [9] org.apache.lucene.index.IndexWriter.addDocument (IndexWriter.java:1,469)
  [10] org.apache.lucene.demo.IndexFiles.indexDoc (IndexFiles.java:271)
  [11] org.apache.lucene.demo.IndexFiles$1.visitFile (IndexFiles.java:212)
  [12] org.apache.lucene.demo.IndexFiles$1.visitFile (IndexFiles.java:208)
  [13] java.nio.file.Files.walkFileTree (Files.java:2,725)
  [14] java.nio.file.Files.walkFileTree (Files.java:2,797)
  [15] org.apache.lucene.demo.IndexFiles.indexDocs (IndexFiles.java:206)
  [16] org.apache.lucene.demo.IndexFiles.main (IndexFiles.java:157)
main[1] list
272    
273      @Override
274      public void writeField(FieldInfo info, IndexableField field) throws IOException {
275    
276 =>     ++numStoredFieldsInDoc;
277    
278        int bits = 0;
279        final BytesRef bytes;
280        final String string;
281    
main[1] print field
 field = "stored,indexed,omitNorms,indexOptions=DOCS<path:/home/ubuntu/doc/mongo.txt>"
main[1] print info
 info = "org.apache.lucene.index.FieldInfo@32464a14"

分词和倒排索引

main[1] where
  [1] org.apache.lucene.index.IndexingChain$PerField.invert (IndexingChain.java:1,138)
  [2] org.apache.lucene.index.IndexingChain.processField (IndexingChain.java:729)
  [3] org.apache.lucene.index.IndexingChain.processDocument (IndexingChain.java:620)
  [4] org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments (DocumentsWriterPerThread.java:241)
  [5] org.apache.lucene.index.DocumentsWriter.updateDocuments (DocumentsWriter.java:432)
  [6] org.apache.lucene.index.IndexWriter.updateDocuments (IndexWriter.java:1,531)
  [7] org.apache.lucene.index.IndexWriter.updateDocument (IndexWriter.java:1,816)
  [8] org.apache.lucene.demo.IndexFiles.indexDoc (IndexFiles.java:277)
  [9] org.apache.lucene.demo.IndexFiles$1.visitFile (IndexFiles.java:212)
  [10] org.apache.lucene.demo.IndexFiles$1.visitFile (IndexFiles.java:208)
  [11] java.nio.file.Files.walkFileTree (Files.java:2,725)
  [12] java.nio.file.Files.walkFileTree (Files.java:2,797)
  [13] org.apache.lucene.demo.IndexFiles.indexDocs (IndexFiles.java:206)
  [14] org.apache.lucene.demo.IndexFiles.main (IndexFiles.java:157)

term描述

      IntBlockPool intPool,
      ByteBlockPool bytePool,
      ByteBlockPool termBytePool,

倒排索引term 在内存中用以下内容描述: intPool 包含三个变量:

那么buffers[xxx][yyy]的值又是什么呢? 这个buffers二维数组存的也是偏移量.是什么的偏移量呢?

intPool描述的是bytePooltermBytePool 的偏移量

term 写入tim文件

会将term一个个写入

main[1] where 
  [1] org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$TermsWriter.writeBlock (Lucene90BlockTreeTermsWriter.java:963)
  [2] org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$TermsWriter.writeBlocks (Lucene90BlockTreeTermsWriter.java:709)
  [3] org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$TermsWriter.finish (Lucene90BlockTreeTermsWriter.java:1,105)
  [4] org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter.write (Lucene90BlockTreeTermsWriter.java:370)
  [5] org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.write (PerFieldPostingsFormat.java:171)
  [6] org.apache.lucene.index.FreqProxTermsWriter.flush (FreqProxTermsWriter.java:131)
  [7] org.apache.lucene.index.IndexingChain.flush (IndexingChain.java:300)
  [8] org.apache.lucene.index.DocumentsWriterPerThread.flush (DocumentsWriterPerThread.java:391)
  [9] org.apache.lucene.index.DocumentsWriter.doFlush (DocumentsWriter.java:493)
  [10] org.apache.lucene.index.DocumentsWriter.flushAllThreads (DocumentsWriter.java:672)
  [11] org.apache.lucene.index.IndexWriter.doFlush (IndexWriter.java:4,014)
  [12] org.apache.lucene.index.IndexWriter.flush (IndexWriter.java:3,988)
  [13] org.apache.lucene.index.IndexWriter.shutdown (IndexWriter.java:1,321)
  [14] org.apache.lucene.index.IndexWriter.close (IndexWriter.java:1,361)
  [15] org.apache.lucene.demo.IndexFiles.main (IndexFiles.java:166)

查询

main[1] where
  [1] org.apache.lucene.search.TopScoreDocCollector$SimpleTopScoreDocCollector.getLeafCollector (TopScoreDocCollector.java:57)
  [2] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:759)
  [3] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:693)
  [4] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:687)
  [5] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
  [6] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
  [7] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
  [8] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

获取term

从terms reader 读取term

main[1] print fieldMap.get(field)
 fieldMap.get(field) = "BlockTreeTerms(seg=_j terms=18,postings=20,positions=25,docs=2)"
main[1] where
  [1] org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsReader.terms (Lucene90BlockTreeTermsReader.java:294)
  [2] org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.terms (PerFieldPostingsFormat.java:353)
  [3] org.apache.lucene.index.CodecReader.terms (CodecReader.java:114)
  [4] org.apache.lucene.index.Terms.getTerms (Terms.java:41)
  [5] org.apache.lucene.index.TermStates.loadTermsEnum (TermStates.java:115)
  [6] org.apache.lucene.index.TermStates.build (TermStates.java:102)
  [7] org.apache.lucene.search.TermQuery.createWeight (TermQuery.java:227)
  [8] org.apache.lucene.search.IndexSearcher.createWeight (IndexSearcher.java:885)
  [9] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:686)
  [10] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
  [11] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
  [12] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
  [13] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

通过arc 获取对应output

Breakpoint hit: "thread=main", org.apache.lucene.util.fst.FST.findTargetArc(), line=1,412 bci=0
1,412        if (labelToMatch == END_LABEL) {

main[1] where
  [1] org.apache.lucene.util.fst.FST.findTargetArc (FST.java:1,412)
  [2] org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum.seekExact (SegmentTermsEnum.java:511)
  [3] org.apache.lucene.index.TermStates.loadTermsEnum (TermStates.java:117)
  [4] org.apache.lucene.index.TermStates.build (TermStates.java:102)
  [5] org.apache.lucene.search.TermQuery.createWeight (TermQuery.java:227)
  [6] org.apache.lucene.search.IndexSearcher.createWeight (IndexSearcher.java:885)
  [7] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:686)
  [8] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
  [9] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
  [10] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
  [11] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

打开tim文件

main[1] where
  [1] org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsReader.<init> (Lucene90BlockTreeTermsReader.java:135)
  [2] org.apache.lucene.codecs.lucene90.Lucene90PostingsFormat.fieldsProducer (Lucene90PostingsFormat.java:427)
  [3] org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.<init> (PerFieldPostingsFormat.java:329)
  [4] org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.fieldsProducer (PerFieldPostingsFormat.java:391)
  [5] org.apache.lucene.index.SegmentCoreReaders.<init> (SegmentCoreReaders.java:118)
  [6] org.apache.lucene.index.SegmentReader.<init> (SegmentReader.java:91)
  [7] org.apache.lucene.index.StandardDirectoryReader$1.doBody (StandardDirectoryReader.java:94)
  [8] org.apache.lucene.index.StandardDirectoryReader$1.doBody (StandardDirectoryReader.java:77)
  [9] org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run (SegmentInfos.java:809)
  [10] org.apache.lucene.index.StandardDirectoryReader.open (StandardDirectoryReader.java:109)
  [11] org.apache.lucene.index.StandardDirectoryReader.open (StandardDirectoryReader.java:67)
  [12] org.apache.lucene.index.DirectoryReader.open (DirectoryReader.java:60)
  [13] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:105)

获取topk的数据核心函数mergeAux,一个辅助函数获取topk的内容

Step completed: "thread=main", org.apache.lucene.search.TopDocs.mergeAux(), line=291 bci=43
291        for (int shardIDX = 0; shardIDX < shardHits.length; shardIDX++) {

main[1] where
  [1] org.apache.lucene.search.TopDocs.mergeAux (TopDocs.java:291)
  [2] org.apache.lucene.search.TopDocs.merge (TopDocs.java:216)
  [3] org.apache.lucene.search.IndexSearcher$2.reduce (IndexSearcher.java:528)
  [4] org.apache.lucene.search.IndexSearcher$2.reduce (IndexSearcher.java:505)
  [5] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:694)
  [6] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:687)
  [7] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
  [8] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
  [9] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
  [10] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

docid 获取对应的文案内容

通过docid 获取document

main[1] where
  [1] org.apache.lucene.store.ByteBufferIndexInput$SingleBufferImpl.seek (ByteBufferIndexInput.java:529)
  [2] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader$BlockState.document (Lucene90CompressingStoredFieldsReader.java:594)
  [3] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader.document (Lucene90CompressingStoredFieldsReader.java:610)
  [4] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader.visitDocument (Lucene90CompressingStoredFieldsReader.java:628)
  [5] org.apache.lucene.index.CodecReader.document (CodecReader.java:89)
  [6] org.apache.lucene.index.BaseCompositeReader.document (BaseCompositeReader.java:154)
  [7] org.apache.lucene.index.IndexReader.document (IndexReader.java:380)
  [8] org.apache.lucene.search.IndexSearcher.doc (IndexSearcher.java:380)
  [9] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:214)
  [10] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

seek 方法通过偏移获取document,其中seek 中curBufjava.nio.DirectByteBufferR

525    
526        @Override
527        public void seek(long pos) throws IOException {
528          try {
529 =>         curBuf.position((int) pos);
530          } catch (IllegalArgumentException e) {
531            if (pos < 0) {
532              throw new IllegalArgumentException("Seeking to negative position: " + this, e);
533            } else {
534              throw new EOFException("seek past EOF: " + this);
main[1] print curBuf
 curBuf = "java.nio.DirectByteBufferR[pos=60 lim=154 cap=154]"
main[1] list
168    
169      // NOTE: AIOOBE not EOF if you read too much
170      @Override
171      public void readBytes(byte[] b, int offset, int len) {
172 =>     System.arraycopy(bytes, pos, b, offset, len);
173        pos += len;
174      }
175    }
main[1] where
  [1] org.apache.lucene.store.ByteArrayDataInput.readBytes (ByteArrayDataInput.java:172)
  [2] org.apache.lucene.store.DataInput.readString (DataInput.java:265)
  [3] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader.readField (Lucene90CompressingStoredFieldsReader.java:246)
  [4] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader.visitDocument (Lucene90CompressingStoredFieldsReader.java:640)
  [5] org.apache.lucene.index.CodecReader.document (CodecReader.java:89)
  [6] org.apache.lucene.index.BaseCompositeReader.document (BaseCompositeReader.java:154)
  [7] org.apache.lucene.index.IndexReader.document (IndexReader.java:380)
  [8] org.apache.lucene.search.IndexSearcher.doc (IndexSearcher.java:380)
  [9] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:214)
  [10] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

通过堆外内存加载文件数据

Breakpoint hit: "thread=main", org.apache.lucene.store.ByteBufferIndexInput.setCurBuf(), line=83 bci=0
83        this.curBuf = curBuf;

main[1] where
  [1] org.apache.lucene.store.ByteBufferIndexInput.setCurBuf (ByteBufferIndexInput.java:83)
  [2] org.apache.lucene.store.ByteBufferIndexInput$SingleBufferImpl.<init> (ByteBufferIndexInput.java:520)
  [3] org.apache.lucene.store.ByteBufferIndexInput.newInstance (ByteBufferIndexInput.java:60)
  [4] org.apache.lucene.store.MMapDirectory.openInput (MMapDirectory.java:238)
  [5] org.apache.lucene.store.Directory.openChecksumInput (Directory.java:152)
  [6] org.apache.lucene.index.SegmentInfos.readCommit (SegmentInfos.java:297)
  [7] org.apache.lucene.index.StandardDirectoryReader$1.doBody (StandardDirectoryReader.java:88)
  [8] org.apache.lucene.index.StandardDirectoryReader$1.doBody (StandardDirectoryReader.java:77)
  [9] org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run (SegmentInfos.java:809)
  [10] org.apache.lucene.index.StandardDirectoryReader.open (StandardDirectoryReader.java:109)
  [11] org.apache.lucene.index.StandardDirectoryReader.open (StandardDirectoryReader.java:67)
  [12] org.apache.lucene.index.DirectoryReader.open (DirectoryReader.java:60)
  [13] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:105)

filechannel 的 map

java 对应的类

src\java.base\share\classes\sun\nio\ch\FileChannelImpl.java
    // Creates a new mapping
    private native long map0(int prot, long position, long length, boolean isSync)
        throws IOException;

native 对应的c实现类

src\java.base\unix\native\libnio\ch\FileChannelImpl.c
JNIEXPORT jlong JNICALL
Java_sun_nio_ch_FileChannelImpl_map0(JNIEnv *env, jobject this,
                                     jint prot, jlong off, jlong len, jboolean map_sync)
{
    void *mapAddress = 0;
    jobject fdo = (*env)->GetObjectField(env, this, chan_fd);
    jint fd = fdval(env, fdo);
    int protections = 0;
    int flags = 0;

    // should never be called with map_sync and prot == PRIVATE
    assert((prot != sun_nio_ch_FileChannelImpl_MAP_PV) || !map_sync);

    if (prot == sun_nio_ch_FileChannelImpl_MAP_RO) {
        protections = PROT_READ;
        flags = MAP_SHARED;
    } else if (prot == sun_nio_ch_FileChannelImpl_MAP_RW) {
        protections = PROT_WRITE | PROT_READ;
        flags = MAP_SHARED;
    } else if (prot == sun_nio_ch_FileChannelImpl_MAP_PV) {
        protections =  PROT_WRITE | PROT_READ;
        flags = MAP_PRIVATE;
    }

    // if MAP_SYNC and MAP_SHARED_VALIDATE are not defined then it is
    // best to define them here. This ensures the code compiles on old
    // OS releases which do not provide the relevant headers. If run
    // on the same machine then it will work if the kernel contains
    // the necessary support otherwise mmap should fail with an
    // invalid argument error

#ifndef MAP_SYNC
#define MAP_SYNC 0x80000
#endif
#ifndef MAP_SHARED_VALIDATE
#define MAP_SHARED_VALIDATE 0x03
#endif

    if (map_sync) {
        // ensure
        //  1) this is Linux on AArch64, x86_64, or PPC64 LE
        //  2) the mmap APIs are available at compile time
#if !defined(LINUX) || ! (defined(aarch64) || (defined(amd64) && defined(_LP64)) || defined(ppc64le))
        // TODO - implement for solaris/AIX/BSD/WINDOWS and for 32 bit
        JNU_ThrowInternalError(env, "should never call map on platform where MAP_SYNC is unimplemented");
        return IOS_THROWN;
#else
        flags |= MAP_SYNC | MAP_SHARED_VALIDATE;
#endif
    }

    mapAddress = mmap64(
        0,                    /* Let OS decide location */
        len,                  /* Number of bytes to map */
        protections,          /* File permissions */
        flags,                /* Changes are shared */
        fd,                   /* File descriptor of mapped file */
        off);                 /* Offset into file */

    if (mapAddress == MAP_FAILED) {
        if (map_sync && errno == ENOTSUP) {
            JNU_ThrowIOExceptionWithLastError(env, "map with mode MAP_SYNC unsupported");
            return IOS_THROWN;
        }

        if (errno == ENOMEM) {
            JNU_ThrowOutOfMemoryError(env, "Map failed");
            return IOS_THROWN;
        }
        return handle(env, -1, "Map failed");
    }

    return ((jlong) (unsigned long) mapAddress);
}

mmap 映射文件读取硬盘中的内容

FileChannel.open 底层是一个native方法,如果是linux系统,就是mmap64

main[1] list
228    
229      /** Creates an IndexInput for the file with the given name. */
230      @Override
231      public IndexInput openInput(String name, IOContext context) throws IOException {
232 =>     ensureOpen();
233        ensureCanRead(name);
234        Path path = directory.resolve(name);
235        try (FileChannel c = FileChannel.open(path, StandardOpenOption.READ)) {
236          final String resourceDescription = "MMapIndexInput(path=\"" + path.toString() + "\")";
237          final boolean useUnmap = getUseUnmap();
main[1] print name
 name = "_j.fnm"
main[1] where
  [1] org.apache.lucene.store.MMapDirectory.openInput (MMapDirectory.java:232)
  [2] org.apache.lucene.store.Directory.openChecksumInput (Directory.java:152)
  [3] org.apache.lucene.codecs.lucene90.Lucene90FieldInfosFormat.read (Lucene90FieldInfosFormat.java:124)
  [4] org.apache.lucene.index.SegmentCoreReaders.<init> (SegmentCoreReaders.java:111)
  [5] org.apache.lucene.index.SegmentReader.<init> (SegmentReader.java:91)
  [6] org.apache.lucene.index.StandardDirectoryReader$1.doBody (StandardDirectoryReader.java:94)
  [7] org.apache.lucene.index.StandardDirectoryReader$1.doBody (StandardDirectoryReader.java:77)
  [8] org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run (SegmentInfos.java:809)
  [9] org.apache.lucene.index.StandardDirectoryReader.open (StandardDirectoryReader.java:109)
  [10] org.apache.lucene.index.StandardDirectoryReader.open (StandardDirectoryReader.java:67)
  [11] org.apache.lucene.index.DirectoryReader.open (DirectoryReader.java:60)
  [12] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:105)

读取mmap后的数据

mmap之后的buf在哪里会被用到呢?
和普通的文件读写类似,也就是seek后读字节

lucene\core\src\java\org\apache\lucene\store\ByteBufferIndexInput.java
  @Override
  public final void readBytes(byte[] b, int offset, int len) throws IOException {
    try {
      guard.getBytes(curBuf, b, offset, len);
    } catch (
        @SuppressWarnings("unused")
        BufferUnderflowException e) {
      int curAvail = curBuf.remaining();
      while (len > curAvail) {
        guard.getBytes(curBuf, b, offset, curAvail);
        len -= curAvail;
        offset += curAvail;
        curBufIndex++;
        if (curBufIndex >= buffers.length) {
          throw new EOFException("read past EOF: " + this);
        }
        setCurBuf(buffers[curBufIndex]);
        curBuf.position(0);
        curAvail = curBuf.remaining();
      }
      guard.getBytes(curBuf, b, offset, len);
    } catch (
        @SuppressWarnings("unused")
        NullPointerException npe) {
      throw new AlreadyClosedException("Already closed: " + this);
    }
  }

mmap后读取数据

main[1] where
  [1] jdk.internal.misc.Unsafe.copyMemory (Unsafe.java:782)
  [2] java.nio.DirectByteBuffer.get (DirectByteBuffer.java:308)
  [3] org.apache.lucene.store.ByteBufferGuard.getBytes (ByteBufferGuard.java:93)
  [4] org.apache.lucene.store.ByteBufferIndexInput.readBytes (ByteBufferIndexInput.java:114)
  [5] org.apache.lucene.store.BufferedChecksumIndexInput.readBytes (BufferedChecksumIndexInput.java:46)
  [6] org.apache.lucene.store.DataInput.readString (DataInput.java:265)
  [7] org.apache.lucene.codecs.CodecUtil.checkHeaderNoMagic (CodecUtil.java:202)
  [8] org.apache.lucene.codecs.CodecUtil.checkHeader (CodecUtil.java:193)
  [9] org.apache.lucene.codecs.CodecUtil.checkIndexHeader (CodecUtil.java:253)
  [10] org.apache.lucene.codecs.lucene90.Lucene90FieldInfosFormat.read (Lucene90FieldInfosFormat.java:128)
  [11] org.apache.lucene.index.SegmentCoreReaders.<init> (SegmentCoreReaders.java:111)
  [12] org.apache.lucene.index.SegmentReader.<init> (SegmentReader.java:91)
  [13] org.apache.lucene.index.StandardDirectoryReader$1.doBody (StandardDirectoryReader.java:94)
  [14] org.apache.lucene.index.StandardDirectoryReader$1.doBody (StandardDirectoryReader.java:77)
  [15] org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run (SegmentInfos.java:809)
  [16] org.apache.lucene.index.StandardDirectoryReader.open (StandardDirectoryReader.java:109)
  [17] org.apache.lucene.index.StandardDirectoryReader.open (StandardDirectoryReader.java:67)
  [18] org.apache.lucene.index.DirectoryReader.open (DirectoryReader.java:60)
  [19] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:105)

文件格式介绍

.fnm 文件

格式出处

fnm 文件 由这几部分组成:

fnm 描述的field的基础信息,也可以算是metadata信息



Field names are stored in the field info file, with suffix .fnm.

FieldInfos (.fnm) --> Header,FieldsCount, <FieldName,FieldNumber, FieldBits,DocValuesBits,DocValuesGen,Attributes,DimensionCount,DimensionNumBytes> ,Footer

Data types:

Header --> IndexHeader
FieldsCount --> VInt
FieldName --> String
FieldBits, IndexOptions, DocValuesBits --> Byte
FieldNumber, DimensionCount, DimensionNumBytes --> VInt
Attributes --> Map<String,String>
DocValuesGen --> Int64
Footer --> CodecFooter
Field Descriptions:
FieldsCount: the number of fields in this file.
FieldName: name of the field as a UTF-8 String.
FieldNumber: the field's number. Note that unlike previous versions of Lucene, the fields are not numbered implicitly by their order in the file, instead explicitly.
FieldBits: a byte containing field options.
The low order bit (0x1) is one for fields that have term vectors stored, and zero for fields without term vectors.
If the second lowest order-bit is set (0x2), norms are omitted for the indexed field.
If the third lowest-order bit is set (0x4), payloads are stored for the indexed field.
IndexOptions: a byte containing index options.
0: not indexed
1: indexed as DOCS_ONLY
2: indexed as DOCS_AND_FREQS
3: indexed as DOCS_AND_FREQS_AND_POSITIONS
4: indexed as DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS
DocValuesBits: a byte containing per-document value types. The type recorded as two four-bit integers, with the high-order bits representing norms options, and the low-order bits representing DocValues options. Each four-bit integer can be decoded as such:
0: no DocValues for this field.
1: NumericDocValues. (DocValuesType.NUMERIC)
2: BinaryDocValues. (DocValuesType#BINARY)
3: SortedDocValues. (DocValuesType#SORTED)
DocValuesGen is the generation count of the field's DocValues. If this is -1, there are no DocValues updates to that field. Anything above zero means there are updates stored by DocValuesFormat.
Attributes: a key-value map of codec-private attributes.
PointDimensionCount, PointNumBytes: these are non-zero only if the field is indexed as points, e.g. using LongPoint
VectorDimension: it is non-zero if the field is indexed as vectors.
VectorSimilarityFunction: a byte containing distance function used for similarity calculation.
0: EUCLIDEAN distance. (VectorSimilarityFunction.EUCLIDEAN)
1: DOT_PRODUCT similarity. (VectorSimilarityFunction.DOT_PRODUCT)
2: COSINE similarity. (VectorSimilarityFunction.COSINE)

.fdt

文件路径: lucene\backward-codecs\src\java\org\apache\lucene\backward_codecs\lucene50\Lucene50CompoundFormat.java

没有找到90的版本的fdt格式,只有2.9.4的,将就使用fdt格式

main[1] print fieldsStreamFN
 fieldsStreamFN = "_j.fdt"
main[1] list
124        numDocs = si.maxDoc();
125    
126        final String fieldsStreamFN =
127            IndexFileNames.segmentFileName(segment, segmentSuffix, FIELDS_EXTENSION);
128 =>     ChecksumIndexInput metaIn = null;
129        try {
130          // Open the data file
131          fieldsStream = d.openInput(fieldsStreamFN, context);
132          version =
133              CodecUtil.checkIndexHeader(
main[1] where
  [1] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader.<init> (Lucene90CompressingStoredFieldsReader.java:128)
  [2] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsFormat.fieldsReader (Lucene90CompressingStoredFieldsFormat.java:133)
  [3] org.apache.lucene.codecs.lucene90.Lucene90StoredFieldsFormat.fieldsReader (Lucene90StoredFieldsFormat.java:136)
  [4] org.apache.lucene.index.SegmentCoreReaders.<init> (SegmentCoreReaders.java:138)
  [5] org.apache.lucene.index.SegmentReader.<init> (SegmentReader.java:91)
  [6] org.apache.lucene.index.StandardDirectoryReader$1.doBody (StandardDirectoryReader.java:94)
  [7] org.apache.lucene.index.StandardDirectoryReader$1.doBody (StandardDirectoryReader.java:77)
  [8] org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run (SegmentInfos.java:809)
  [9] org.apache.lucene.index.StandardDirectoryReader.open (StandardDirectoryReader.java:109)
  [10] org.apache.lucene.index.StandardDirectoryReader.open (StandardDirectoryReader.java:67)
  [11] org.apache.lucene.index.DirectoryReader.open (DirectoryReader.java:60)
  [12] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:105)

加载doc的内容到Document 对象

整个流程是通过docid 获取document 的内容


  @Override
  public void visitDocument(int docID, StoredFieldVisitor visitor) throws IOException {

    final SerializedDocument doc = document(docID);  // 通过docID 获取doc对象

    for (int fieldIDX = 0; fieldIDX < doc.numStoredFields; fieldIDX++) {
      final long infoAndBits = doc.in.readVLong();
      final int fieldNumber = (int) (infoAndBits >>> TYPE_BITS);
      final FieldInfo fieldInfo = fieldInfos.fieldInfo(fieldNumber);

      final int bits = (int) (infoAndBits & TYPE_MASK);
      assert bits <= NUMERIC_DOUBLE : "bits=" + Integer.toHexString(bits);

      switch (visitor.needsField(fieldInfo)) {
        case YES:
          readField(doc.in, visitor, fieldInfo, bits);   // 通过input , 也就是input 绑定的fd ,去读mmap64 映射的文件 ,在这里会读取后缀名为 .fdt 的文件 
          break;
        ...
      }
    }
  }
main[1] where
  [1] org.apache.lucene.document.Document.add (Document.java:60)
  [2] org.apache.lucene.document.DocumentStoredFieldVisitor.stringField (DocumentStoredFieldVisitor.java:74)
  [3] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader.readField (Lucene90CompressingStoredFieldsReader.java:246)
  [4] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader.visitDocument (Lucene90CompressingStoredFieldsReader.java:640)
  [5] org.apache.lucene.index.CodecReader.document (CodecReader.java:89)
  [6] org.apache.lucene.index.BaseCompositeReader.document (BaseCompositeReader.java:154)
  [7] org.apache.lucene.index.IndexReader.document (IndexReader.java:380)
  [8] org.apache.lucene.search.IndexSearcher.doc (IndexSearcher.java:380)
  [9] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:214)
  [10] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

通过docid 构建 SerializedDocument

首先入口在这里:

org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader.document

Lucene90CompressingStoredFieldsReader的document 方法

main[1] where
  [1] org.apache.lucene.store.ByteBufferIndexInput$SingleBufferImpl.seek (ByteBufferIndexInput.java:529)
  [2] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader.document (Lucene90CompressingStoredFieldsReader.java:606)
  [3] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader.visitDocument (Lucene90CompressingStoredFieldsReader.java:628)
  [4] org.apache.lucene.index.CodecReader.document (CodecReader.java:89)
  [5] org.apache.lucene.index.BaseCompositeReader.document (BaseCompositeReader.java:154)
  [6] org.apache.lucene.index.IndexReader.document (IndexReader.java:380)
  [7] org.apache.lucene.search.IndexSearcher.doc (IndexSearcher.java:380)
  [8] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:214)
  [9] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)
  SerializedDocument document(int docID) throws IOException {
    if (state.contains(docID) == false) {
      fieldsStream.seek(indexReader.getStartPointer(docID)); // 通过mmap64 偏移
      state.reset(docID);
    }
    assert state.contains(docID);
    return state.document(docID);    // 再看具体的实现 , 这个state 对象对应的类是一个静态内部类
  }

下面看看静态内部类的实现

    /**
     * Get the serialized representation of the given docID. This docID has to be contained in the
     * current block.
     */
    SerializedDocument document(int docID) throws IOException {
      if (contains(docID) == false) {
        throw new IllegalArgumentException();
      }

      final int index = docID - docBase;   
      final int offset = Math.toIntExact(offsets[index]);
      final int length = Math.toIntExact(offsets[index + 1]) - offset;
      final int totalLength = Math.toIntExact(offsets[chunkDocs]);
      final int numStoredFields = Math.toIntExact(this.numStoredFields[index]);

      final BytesRef bytes;
      if (merging) {
        bytes = this.bytes;
      } else {
        bytes = new BytesRef();
      }

      final DataInput documentInput;
      if (length == 0) {
        ...
      } else {
        fieldsStream.seek(startPointer);  // seek mmap64 偏移量获取文件
        decompressor.decompress(fieldsStream, totalLength, offset, length, bytes); // 解压对应的数据  
        assert bytes.length == length;
        documentInput = new ByteArrayDataInput(bytes.bytes, bytes.offset, bytes.length);  // 将数据塞入bytes
      }

      return new SerializedDocument(documentInput, length, numStoredFields);  // 构建SerializedDocument
    }
  }

下面具体描述加载内容的过程:

 pos = 4
main[1] dump bytes
 bytes = {
120, 116, 0, 26, 47, 104, 111, 109, 101, 47, 117, 98, 117, 110, 116, 117, 47, 100, 111, 99, 47, 104, 101, 108, 108, 111, 46, 116, 120, 116, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
}
main[1] print in
 in = "MMapIndexInput(path="/home/ubuntu/index/_j.fdt")"
main[1] where
  [1] org.apache.lucene.codecs.lucene90.LZ4WithPresetDictCompressionMode$LZ4WithPresetDictDecompressor.decompress (LZ4WithPresetDictCompressionMode.java:88)
  [2] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader$BlockState.document (Lucene90CompressingStoredFieldsReader.java:595)
  [3] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader.document (Lucene90CompressingStoredFieldsReader.java:610)
  [4] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader.visitDocument (Lucene90CompressingStoredFieldsReader.java:628)
  [5] org.apache.lucene.index.CodecReader.document (CodecReader.java:89)
  [6] org.apache.lucene.index.BaseCompositeReader.document (BaseCompositeReader.java:154)
  [7] org.apache.lucene.index.IndexReader.document (IndexReader.java:380)
  [8] org.apache.lucene.search.IndexSearcher.doc (IndexSearcher.java:380)
  [9] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:214)
  [10] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

term 文件的加载和处理

 public SegmentTermsEnum(FieldReader fr) throws IOException {
    this.fr = fr;

    // if (DEBUG) {
    //   System.out.println("BTTR.init seg=" + fr.parent.segment);
    // }
    stack = new SegmentTermsEnumFrame[0];

    // Used to hold seek by TermState, or cached seek
    staticFrame = new SegmentTermsEnumFrame(this, -1);

    if (fr.index == null) {
      fstReader = null;
    } else {
      fstReader = fr.index.getBytesReader();
    }

    // Init w/ root block; don't use index since it may
    // not (and need not) have been loaded
    for (int arcIdx = 0; arcIdx < arcs.length; arcIdx++) {
      arcs[arcIdx] = new FST.Arc<>();
    }

    currentFrame = staticFrame;
    final FST.Arc<BytesRef> arc;
    if (fr.index != null) {
      arc = fr.index.getFirstArc(arcs[0]);
      // Empty string prefix must have an output in the index!
      assert arc.isFinal();
    } else {
      arc = null;
    }
    // currentFrame = pushFrame(arc, rootCode, 0);
    // currentFrame.loadBlock();
    validIndexPrefix = 0;
    // if (DEBUG) {
    //   System.out.println("init frame state " + currentFrame.ord);
    //   printSeekState();
    // }

    // System.out.println();
    // computeBlockStats().print(System.out);
  }

解析获取getArc

  private FST.Arc<BytesRef> getArc(int ord) {
    if (ord >= arcs.length) {
      @SuppressWarnings({"rawtypes", "unchecked"})
      final FST.Arc<BytesRef>[] next =
          new FST.Arc[ArrayUtil.oversize(1 + ord, RamUsageEstimator.NUM_BYTES_OBJECT_REF)];
      System.arraycopy(arcs, 0, next, 0, arcs.length);
      for (int arcOrd = arcs.length; arcOrd < next.length; arcOrd++) {
        next[arcOrd] = new FST.Arc<>();
      }
      arcs = next;
    }
    return arcs[ord];
  }
Breakpoint hit: "thread=main", org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum.getArc(), line=222 bci=0
222        if (ord >= arcs.length) {

main[1] where
  [1] org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum.getArc (SegmentTermsEnum.java:222)
  [2] org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum.seekExact (SegmentTermsEnum.java:511)
  [3] org.apache.lucene.index.TermStates.loadTermsEnum (TermStates.java:117)
  [4] org.apache.lucene.index.TermStates.build (TermStates.java:102)
  [5] org.apache.lucene.search.TermQuery.createWeight (TermQuery.java:227)
  [6] org.apache.lucene.search.IndexSearcher.createWeight (IndexSearcher.java:885)
  [7] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:686)
  [8] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
  [9] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
  [10] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
  [11] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

获取所有的数据

main[1] where
  [1] org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll (Weight.java:300)
  [2] org.apache.lucene.search.Weight$DefaultBulkScorer.score (Weight.java:247)
  [3] org.apache.lucene.search.BulkScorer.score (BulkScorer.java:38)
  [4] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:770)
  [5] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:693)
  [6] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:687)
  [7] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
  [8] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
  [9] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
  [10] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)
main[1] list
296            DocIdSetIterator iterator,
297            TwoPhaseIterator twoPhase,
298            Bits acceptDocs)
299            throws IOException {
300 =>       if (twoPhase == null) {
301            for (int doc = iterator.nextDoc();
302                doc != DocIdSetIterator.NO_MORE_DOCS;
303                doc = iterator.nextDoc()) {
304              if (acceptDocs == null || acceptDocs.get(doc)) {
305                collector.collect(doc);
main[1] print iterator
 iterator = "org.apache.lucene.search.ImpactsDISI@6279cee3"
main[1] list
494        @Override
495        public int advance(int target) throws IOException {
496          // current skip docID < docIDs generated from current buffer <= next skip docID
497          // we don't need to skip if target is buffered already
498 =>       if (docFreq > BLOCK_SIZE && target > nextSkipDoc) {
499    
500            if (skipper == null) {
501              // Lazy init: first time this enum has ever been used for skipping
502              skipper =
503                  new Lucene90SkipReader(
main[1] where
  [1] org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockDocsEnum.advance (Lucene90PostingsReader.java:498)
  [2] org.apache.lucene.index.SlowImpactsEnum.advance (SlowImpactsEnum.java:77)
  [3] org.apache.lucene.search.ImpactsDISI.advance (ImpactsDISI.java:135)
  [4] org.apache.lucene.search.ImpactsDISI.nextDoc (ImpactsDISI.java:140)
  [5] org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll (Weight.java:301)
  [6] org.apache.lucene.search.Weight$DefaultBulkScorer.score (Weight.java:247)
  [7] org.apache.lucene.search.BulkScorer.score (BulkScorer.java:38)
  [8] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:770)
  [9] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:693)
  [10] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:687)
  [11] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
  [12] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
  [13] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
  [14] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

生成iterator 的相关类 , 对应的是SegmentTermsEnum

main[1] where
  [1] org.apache.lucene.search.TermQuery$TermWeight.getTermsEnum (TermQuery.java:145)
  [2] org.apache.lucene.search.TermQuery$TermWeight.scorer (TermQuery.java:107)
  [3] org.apache.lucene.search.Weight.bulkScorer (Weight.java:166)
  [4] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:767)
  [5] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:693)
  [6] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:687)
  [7] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
  [8] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
  [9] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
  [10] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)
main[1] print termsEnum
 termsEnum = "org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum@1a84f40f"

getTermsEnum 方法能拿到term的统计位置偏移,SegmentTermsEnum 不包含dociterator

main[1] where
  [1] org.apache.lucene.index.Term.bytes (Term.java:128)
  [2] org.apache.lucene.search.TermQuery$TermWeight.getTermsEnum (TermQuery.java:145)
  [3] org.apache.lucene.search.TermQuery$TermWeight.scorer (TermQuery.java:107)
  [4] org.apache.lucene.search.Weight.bulkScorer (Weight.java:166)
  [5] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:767)
  [6] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:693)
  [7] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:687)
  [8] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
  [9] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
  [10] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
  [11] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)


144          final TermsEnum termsEnum = context.reader().terms(term.field()).iterator();
145 =>       termsEnum.seekExact(term.bytes(), state);
146          return termsEnum;
147        }

这里的term.bytes() 就是我们的搜索值 , 所以term对应的倒排信息是从这里开始读的(还没看完,暂时那么定)

读出倒排信息之后,开始排序. score 有iteration 可以遍历所有doc_id

main[1] list
348        // (needsFreq=false)
349        private boolean isFreqsRead;
350        private int singletonDocID; // docid when there is a single pulsed posting, otherwise -1
351    
352 =>     public BlockDocsEnum(FieldInfo fieldInfo) throws IOException {
353          this.startDocIn = Lucene90PostingsReader.this.docIn;
354          this.docIn = null;
355          indexHasFreq = fieldInfo.getIndexOptions().compareTo(IndexOptions.DOCS_AND_FREQS) >= 0;
356          indexHasPos =
357              fieldInfo.getIndexOptions().compareTo(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS) >= 0;
main[1] where
  [1] org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockDocsEnum.<init> (Lucene90PostingsReader.java:352)
  [2] org.apache.lucene.codecs.lucene90.Lucene90PostingsReader.postings (Lucene90PostingsReader.java:258)
  [3] org.apache.lucene.codecs.lucene90.Lucene90PostingsReader.impacts (Lucene90PostingsReader.java:280)
  [4] org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum.impacts (SegmentTermsEnum.java:1,150)
  [5] org.apache.lucene.search.TermQuery$TermWeight.scorer (TermQuery.java:114)
  [6] org.apache.lucene.search.Weight.bulkScorer (Weight.java:166)
  [7] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:767)
  [8] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:693)
  [9] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:687)
  [10] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
  [11] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
  [12] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
  [13] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

topk collector的堆栈

Breakpoint hit: "thread=main", org.apache.lucene.search.TopDocsCollector.populateResults(), line=64 bci=0
64        for (int i = howMany - 1; i >= 0; i--) {

main[1] where
  [1] org.apache.lucene.search.TopDocsCollector.populateResults (TopDocsCollector.java:64)
  [2] org.apache.lucene.search.TopDocsCollector.topDocs (TopDocsCollector.java:166)
  [3] org.apache.lucene.search.TopDocsCollector.topDocs (TopDocsCollector.java:98)
  [4] org.apache.lucene.search.IndexSearcher$2.reduce (IndexSearcher.java:526)
  [5] org.apache.lucene.search.IndexSearcher$2.reduce (IndexSearcher.java:505)
  [6] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:694)
  [7] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:687)
  [8] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
  [9] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
  [10] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
  [11] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)
main[1]

search 过程

main[1] dump collector
 collector = {
    org.apache.lucene.search.TopScoreDocCollector.docBase: 0
    org.apache.lucene.search.TopScoreDocCollector.pqTop: instance of org.apache.lucene.search.ScoreDoc(id=1529)
    org.apache.lucene.search.TopScoreDocCollector.hitsThresholdChecker: instance of org.apache.lucene.search.HitsThresholdChecker$LocalHitsThresholdChecker(id=1530)
    org.apache.lucene.search.TopScoreDocCollector.minScoreAcc: null
    org.apache.lucene.search.TopScoreDocCollector.minCompetitiveScore: 0.0
    org.apache.lucene.search.TopScoreDocCollector.$assertionsDisabled: true
    org.apache.lucene.search.TopDocsCollector.EMPTY_TOPDOCS: instance of org.apache.lucene.search.TopDocs(id=1531)
    org.apache.lucene.search.TopDocsCollector.pq: instance of org.apache.lucene.search.HitQueue(id=1532)
    org.apache.lucene.search.TopDocsCollector.totalHits: 0
    org.apache.lucene.search.TopDocsCollector.totalHitsRelation: instance of org.apache.lucene.search.TotalHits$Relation(id=1533)
}
main[1] print collector
 collector = "org.apache.lucene.search.TopScoreDocCollector$SimpleTopScoreDocCollector@62bd765"

获取hits 数量的过程

690      private <C extends Collector, T> T search(
691          Weight weight, CollectorManager<C, T> collectorManager, C firstCollector) throws IOException {
692        if (executor == null || leafSlices.length <= 1) {
693          search(leafContexts, weight, firstCollector);
694 =>       return collectorManager.reduce(Collections.singletonList(firstCollector));
695        } else {
696          final List<C> collectors = new ArrayList<>(leafSlices.length);
697          collectors.add(firstCollector);
698          final ScoreMode scoreMode = firstCollector.scoreMode();
699          for (int i = 1; i < leafSlices.length; ++i) {
main[1] where
  [1] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:694)
  [2] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:687)
  [3] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
  [4] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
  [5] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
  [6] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

org.apache.lucene.search.TopScoreDocCollector.create , 一直往上翻,发现org.apache.lucene.search.IndexSearcher.searchAfter 就已经有了. 那么这个hit数量是从哪里初始化的呢?

很明显,search会填充firstCollector的数据,那么是在哪里赋值的呢?

 protected void search(List<LeafReaderContext> leaves, Weight weight, Collector collector)
      throws IOException {

    // TODO: should we make this
    // threaded...? the Collector could be sync'd?
    // always use single thread:
    for (LeafReaderContext ctx : leaves) { // search each subreader
      final LeafCollector leafCollector;
      try {
        leafCollector = collector.getLeafCollector(ctx);
      } catch (
          @SuppressWarnings("unused")
          CollectionTerminatedException e) {
        // there is no doc of interest in this reader context
        // continue with the following leaf
        continue;
      }
      BulkScorer scorer = weight.bulkScorer(ctx);      /// 在这里会获取total hits
      if (scorer != null) {
        try {
          scorer.score(leafCollector, ctx.reader().getLiveDocs());
        } catch (
            @SuppressWarnings("unused")
            CollectionTerminatedException e) {
          // collection was terminated prematurely
          // continue with the following leaf
        }
      }
    }
  }

看完最后的堆栈,我们确定了totalHits 是在这里赋值的 , 也就是只要调用了一次就自增一, 很明显这是一个统计,那么这个统计就是命中的搜索内容,那么搜索内容是怎么来的呢?

我们只能往上追溯

main[1] where
  [1] org.apache.lucene.search.TopScoreDocCollector$SimpleTopScoreDocCollector$1.collect (TopScoreDocCollector.java:73)
  [2] org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll (Weight.java:305)
  [3] org.apache.lucene.search.Weight$DefaultBulkScorer.score (Weight.java:247)
  [4] org.apache.lucene.search.BulkScorer.score (BulkScorer.java:38)
  [5] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:770)
  [6] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:693)
  [7] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:687)
  [8] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
  [9] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
  [10] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
  [11] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

        @Override
        public void collect(int doc) throws IOException {
          float score = scorer.score();   // 计算分数 , 也是回调专用的score 函数 , 插件化

          // This collector relies on the fact that scorers produce positive values:
          assert score >= 0; // NOTE: false for NaN

          totalHits++;                 // hit +1 在这里触发
          hitsThresholdChecker.incrementHitCount();

          if (minScoreAcc != null && (totalHits & minScoreAcc.modInterval) == 0) {
            updateGlobalMinCompetitiveScore(scorer);
          }

          if (score <= pqTop.score) {
            if (totalHitsRelation == TotalHits.Relation.EQUAL_TO) {
              // we just reached totalHitsThreshold, we can start setting the min
              // competitive score now
              updateMinCompetitiveScore(scorer);
            }
            // Since docs are returned in-order (i.e., increasing doc Id), a document
            // with equal score to pqTop.score cannot compete since HitQueue favors
            // documents with lower doc Ids. Therefore reject those docs too.
            return;
          }
          pqTop.doc = doc + docBase;
          pqTop.score = score;
          pqTop = pq.updateTop();
          updateMinCompetitiveScore(scorer);
        }
      };

继续往上面推之后,我们找到了堆栈,scorer 是根据context生成的

  /**
   * Optional method, to return a {@link BulkScorer} to score the query and send hits to a {@link
   * Collector}. Only queries that have a different top-level approach need to override this; the
   * default implementation pulls a normal {@link Scorer} and iterates and collects the resulting
   * hits which are not marked as deleted.
   *
   * @param context the {@link org.apache.lucene.index.LeafReaderContext} for which to return the
   *     {@link Scorer}.
   * @return a {@link BulkScorer} which scores documents and passes them to a collector.
   * @throws IOException if there is a low-level I/O error
   */
  public BulkScorer bulkScorer(LeafReaderContext context) throws IOException {

    Scorer scorer = scorer(context);
    if (scorer == null) {
      // No docs match
      return null;
    }

    // This impl always scores docs in order, so we can
    // ignore scoreDocsInOrder:
    return new DefaultBulkScorer(scorer);
  }

再往上看: 刚刚看到了bulkScorer 回调了一个scorer 方法,这个scorer抽象方法的实现是在org.apache.lucene.search.TermQuery$TermWeight.scorer

这个scorer方法根据入参context 以及外部类termQuery.term计算htis命中的个数

main[1] list
103          assert termStates == null || termStates.wasBuiltFor(ReaderUtil.getTopLevelContext(context))
104              : "The top-reader used to create Weight is not the same as the current reader's top-reader ("
105                  + ReaderUtil.getTopLevelContext(context);
106          ;
107 =>       final TermsEnum termsEnum = getTermsEnum(context);
108          if (termsEnum == null) {
109            return null;
110          }
111          LeafSimScorer scorer =
112              new LeafSimScorer(simScorer, context.reader(), term.field(), scoreMode.needsScores());    // 这里term是外部类的term ,也就是this$0.term
main[1] where
  [1] org.apache.lucene.search.TermQuery$TermWeight.scorer (TermQuery.java:107)
  [2] org.apache.lucene.search.Weight.bulkScorer (Weight.java:166)
  [3] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:767)
  [4] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:693)
  [5] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:687)
  [6] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
  [7] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
  [8] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
  [9] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

所以之后会调用advance,最后调用的是下面这个advance方法, 这里会用到docTermStartFP , 那么这个遍历在哪里初始化?

其实是在termStates里面获取,初始化的地方在docTermStartFP = termState.docStartFP;



lucene\core\src\java\org\apache\lucene\codecs\lucene90\Lucene90PostingsReader.java
    @Override
    public int advance(int target) throws IOException {
      // current skip docID < docIDs generated from current buffer <= next skip docID
      // we don't need to skip if target is buffered already
      if (docFreq > BLOCK_SIZE && target > nextSkipDoc) {

        if (skipper == null) {
          // Lazy init: first time this enum has ever been used for skipping
          skipper =
              new Lucene90SkipReader(
                  docIn.clone(), MAX_SKIP_LEVELS, indexHasPos, indexHasOffsets, indexHasPayloads);
        }

        if (!skipped) {
          assert skipOffset != -1;
          // This is the first time this enum has skipped
          // since reset() was called; load the skip data:
          skipper.init(docTermStartFP + skipOffset, docTermStartFP, 0, 0, docFreq);
          skipped = true;
        }

        // always plus one to fix the result, since skip position in Lucene90SkipReader
        // is a little different from MultiLevelSkipListReader
        final int newDocUpto = skipper.skipTo(target) + 1;

        if (newDocUpto >= blockUpto) {
          // Skipper moved
          assert newDocUpto % BLOCK_SIZE == 0 : "got " + newDocUpto;
          blockUpto = newDocUpto;

          // Force to read next block
          docBufferUpto = BLOCK_SIZE;
          accum = skipper.getDoc(); // actually, this is just lastSkipEntry
          docIn.seek(skipper.getDocPointer()); // now point to the block we want to search
          // even if freqBuffer were not read from the previous block, we will mark them as read,
          // as we don't need to skip the previous block freqBuffer in refillDocs,
          // as we have already positioned docIn where in needs to be.
          isFreqsRead = true;
        }
        // next time we call advance, this is used to
        // foresee whether skipper is necessary.
        nextSkipDoc = skipper.getNextSkipDoc();
      }
      if (docBufferUpto == BLOCK_SIZE) {
        refillDocs();
      }

      // Now scan... this is an inlined/pared down version
      // of nextDoc():
      long doc;
      while (true) {
        doc = docBuffer[docBufferUpto];

        if (doc >= target) {
          break;
        }
        ++docBufferUpto;
      }

      docBufferUpto++;
      return this.doc = (int) doc;
    }

    @Override
    public long cost() {
      return docFreq;
    }
  }

那么我们继续看termStates是怎么初始化的? 我先猜测term会是termStates 的一个成员变量

通过断点,我们最后找到了下面这个:

main[1] list
178      }
179    
180      @Override
181      public BlockTermState newTermState() {
182 =>     return new IntBlockTermState();
183      }
184    
185      @Override
186      public void close() throws IOException {
187        IOUtils.close(docIn, posIn, payIn);
main[1] where
  [1] org.apache.lucene.codecs.lucene90.Lucene90PostingsReader.newTermState (Lucene90PostingsReader.java:182)
  [2] org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnumFrame.<init> (SegmentTermsEnumFrame.java:101)
  [3] org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum.<init> (SegmentTermsEnum.java:76)
  [4] org.apache.lucene.codecs.lucene90.blocktree.FieldReader.iterator (FieldReader.java:153)
  [5] org.apache.lucene.index.TermStates.loadTermsEnum (TermStates.java:116)
  [6] org.apache.lucene.index.TermStates.build (TermStates.java:102)
  [7] org.apache.lucene.search.TermQuery.createWeight (TermQuery.java:227)
  [8] org.apache.lucene.search.IndexSearcher.createWeight (IndexSearcher.java:885)
  [9] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:686)
  [10] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
  [11] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
  [12] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
  [13] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)
main[1]

最后这里应该就是最最核心的获取词的流程了,i hope so

main[1] list
113    
114      private static TermsEnum loadTermsEnum(LeafReaderContext ctx, Term term) throws IOException {
115        final Terms terms = Terms.getTerms(ctx.reader(), term.field());
116        final TermsEnum termsEnum = terms.iterator();
117 =>     if (termsEnum.seekExact(term.bytes())) {
118          return termsEnum;
119        }
120        return null;
121      }
122    
main[1] print term.bytes()
 term.bytes() = "[61 6d]"
main[1] where
  [1] org.apache.lucene.index.TermStates.loadTermsEnum (TermStates.java:117)
  [2] org.apache.lucene.index.TermStates.build (TermStates.java:102)
  [3] org.apache.lucene.search.TermQuery.createWeight (TermQuery.java:227)
  [4] org.apache.lucene.search.IndexSearcher.createWeight (IndexSearcher.java:885)
  [5] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:686)
  [6] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
  [7] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
  [8] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
  [9] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

最后的最后应该是调用这里: 获取所有的term的个数,具体是哪里还需要判断,但是路径应该就是这里了

  // Target's prefix matches this block's prefix; we
  // scan the entries check if the suffix matches.
  public SeekStatus scanToTermLeaf(BytesRef target, boolean exactOnly) throws IOException {

    // if (DEBUG) System.out.println("    scanToTermLeaf: block fp=" + fp + " prefix=" + prefix + "
    // nextEnt=" + nextEnt + " (of " + entCount + ") target=" + brToString(target) + " term=" +
    // brToString(term));

    assert nextEnt != -1;

    ste.termExists = true;
    subCode = 0;

    if (nextEnt == entCount) {
      if (exactOnly) {
        fillTerm();
      }
      return SeekStatus.END;
    }

    assert prefixMatches(target);

    // TODO: binary search when all terms have the same length, which is common for ID fields,
    // which are also the most sensitive to lookup performance?
    // Loop over each entry (term or sub-block) in this block:
    do {
      nextEnt++;

      suffix = suffixLengthsReader.readVInt();

      // if (DEBUG) {
      //   BytesRef suffixBytesRef = new BytesRef();
      //   suffixBytesRef.bytes = suffixBytes;
      //   suffixBytesRef.offset = suffixesReader.getPosition();
      //   suffixBytesRef.length = suffix;
      //   System.out.println("      cycle: term " + (nextEnt-1) + " (of " + entCount + ") suffix="
      // + brToString(suffixBytesRef));
      // }

      startBytePos = suffixesReader.getPosition();
      suffixesReader.skipBytes(suffix);

      // Loop over bytes in the suffix, comparing to the target
      final int cmp =
          Arrays.compareUnsigned(
              suffixBytes,
              startBytePos,
              startBytePos + suffix,
              target.bytes,
              target.offset + prefix,
              target.offset + target.length);

      if (cmp < 0) {
        // Current entry is still before the target;
        // keep scanning
      } else if (cmp > 0) {
        // Done!  Current entry is after target --
        // return NOT_FOUND:
        fillTerm();

        // if (DEBUG) System.out.println("        not found");
        return SeekStatus.NOT_FOUND;
      } else {
        // Exact match!

        // This cannot be a sub-block because we
        // would have followed the index to this
        // sub-block from the start:

        assert ste.termExists;
        fillTerm();
        // if (DEBUG) System.out.println("        found!");
        return SeekStatus.FOUND;
      }
    } while (nextEnt < entCount);

    // It is possible (and OK) that terms index pointed us
    // at this block, but, we scanned the entire block and
    // did not find the term to position to.  This happens
    // when the target is after the last term in the block
    // (but, before the next term in the index).  EG
    // target could be foozzz, and terms index pointed us
    // to the foo* block, but the last term in this block
    // was fooz (and, eg, first term in the next block will
    // bee fop).
    // if (DEBUG) System.out.println("      block end");
    if (exactOnly) {
      fillTerm();
    }

    // TODO: not consistent that in the
    // not-exact case we don't next() into the next
    // frame here
    return SeekStatus.END;
  }

  // Target's prefix matches this block's prefix; we
  // scan the entries check if the suffix matches.
  public SeekStatus scanToTermNonLeaf(BytesRef target, boolean exactOnly) throws IOException {

    // if (DEBUG) System.out.println("    scanToTermNonLeaf: block fp=" + fp + " prefix=" + prefix +
    // " nextEnt=" + nextEnt + " (of " + entCount + ") target=" + brToString(target) + " term=" +
    // brToString(target));

    assert nextEnt != -1;

    if (nextEnt == entCount) {
      if (exactOnly) {
        fillTerm();
        ste.termExists = subCode == 0;
      }
      return SeekStatus.END;
    }

    assert prefixMatches(target);

    // Loop over each entry (term or sub-block) in this block:
    while (nextEnt < entCount) {

      nextEnt++;

      final int code = suffixLengthsReader.readVInt();
      suffix = code >>> 1;

      // if (DEBUG) {
      //  BytesRef suffixBytesRef = new BytesRef();
      //  suffixBytesRef.bytes = suffixBytes;
      //  suffixBytesRef.offset = suffixesReader.getPosition();
      //  suffixBytesRef.length = suffix;
      //  System.out.println("      cycle: " + ((code&1)==1 ? "sub-block" : "term") + " " +
      // (nextEnt-1) + " (of " + entCount + ") suffix=" + brToString(suffixBytesRef));
      // }

      final int termLen = prefix + suffix;
      startBytePos = suffixesReader.getPosition();
      suffixesReader.skipBytes(suffix);
      ste.termExists = (code & 1) == 0;
      if (ste.termExists) {
        state.termBlockOrd++;
        subCode = 0;
      } else {
        subCode = suffixLengthsReader.readVLong();
        lastSubFP = fp - subCode;
      }

      final int cmp =
          Arrays.compareUnsigned(
              suffixBytes,
              startBytePos,
              startBytePos + suffix,
              target.bytes,
              target.offset + prefix,
              target.offset + target.length);

      if (cmp < 0) {
        // Current entry is still before the target;
        // keep scanning
      } else if (cmp > 0) {
        // Done!  Current entry is after target --
        // return NOT_FOUND:
        fillTerm();

        // if (DEBUG) System.out.println("        maybe done exactOnly=" + exactOnly + "
        // ste.termExists=" + ste.termExists);

        if (!exactOnly && !ste.termExists) {
          // System.out.println("  now pushFrame");
          // TODO this
          // We are on a sub-block, and caller wants
          // us to position to the next term after
          // the target, so we must recurse into the
          // sub-frame(s):
          ste.currentFrame = ste.pushFrame(null, ste.currentFrame.lastSubFP, termLen);
          ste.currentFrame.loadBlock();
          while (ste.currentFrame.next()) {
            ste.currentFrame = ste.pushFrame(null, ste.currentFrame.lastSubFP, ste.term.length());
            ste.currentFrame.loadBlock();         /////////////////////////////////////////////////// 这里会有流的加载
          }
        }

        // if (DEBUG) System.out.println("        not found");
        return SeekStatus.NOT_FOUND;
      } else {
        // Exact match!

        // This cannot be a sub-block because we
        // would have followed the index to this
        // sub-block from the start:

        assert ste.termExists;
        fillTerm();
        // if (DEBUG) System.out.println("        found!");
        return SeekStatus.FOUND;
      }
    }

    // It is possible (and OK) that terms index pointed us
    // at this block, but, we scanned the entire block and
    // did not find the term to position to.  This happens
    // when the target is after the last term in the block
    // (but, before the next term in the index).  EG
    // target could be foozzz, and terms index pointed us
    // to the foo* block, but the last term in this block
    // was fooz (and, eg, first term in the next block will
    // bee fop).
    // if (DEBUG) System.out.println("      block end");
    if (exactOnly) {
      fillTerm();
    }

    // TODO: not consistent that in the
    // not-exact case we don't next() into the next
    // frame here
    return SeekStatus.END;
  }

termState 是如何被反序列化的?

Breakpoint hit: "thread=main", org.apache.lucene.codecs.lucene90.Lucene90PostingsReader.decodeTerm(), line=194 bci=0
194        final IntBlockTermState termState = (IntBlockTermState) _termState;

main[1] where
  [1] org.apache.lucene.codecs.lucene90.Lucene90PostingsReader.decodeTerm (Lucene90PostingsReader.java:194)
  [2] org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnumFrame.decodeMetaData (SegmentTermsEnumFrame.java:476)
  [3] org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum.termState (SegmentTermsEnum.java:1,178)
  [4] org.apache.lucene.index.TermStates.build (TermStates.java:104)
  [5] org.apache.lucene.search.TermQuery.createWeight (TermQuery.java:227)
  [6] org.apache.lucene.search.IndexSearcher.createWeight (IndexSearcher.java:885)
  [7] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:686)
  [8] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
  [9] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
  [10] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
  [11] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)


    @Override
  public void decodeTerm(
      DataInput in, FieldInfo fieldInfo, BlockTermState _termState, boolean absolute)
      throws IOException {
    final IntBlockTermState termState = (IntBlockTermState) _termState;
    final boolean fieldHasPositions =
        fieldInfo.getIndexOptions().compareTo(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS) >= 0;
    final boolean fieldHasOffsets =
        fieldInfo.getIndexOptions().compareTo(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS)
            >= 0;
    final boolean fieldHasPayloads = fieldInfo.hasPayloads();

    if (absolute) {
      termState.docStartFP = 0;
      termState.posStartFP = 0;
      termState.payStartFP = 0;
    }

    final long l = in.readVLong();
    if ((l & 0x01) == 0) {
      termState.docStartFP += l >>> 1;
      if (termState.docFreq == 1) {
        termState.singletonDocID = in.readVInt();
      } else {
        termState.singletonDocID = -1;
      }
    } else {
      assert absolute == false;
      assert termState.singletonDocID != -1;
      termState.singletonDocID += BitUtil.zigZagDecode(l >>> 1);
    }

    if (fieldHasPositions) {
      termState.posStartFP += in.readVLong();
      if (fieldHasOffsets || fieldHasPayloads) {
        termState.payStartFP += in.readVLong();
      }
      if (termState.totalTermFreq > BLOCK_SIZE) {
        termState.lastPosBlockOffset = in.readVLong();
      } else {
        termState.lastPosBlockOffset = -1;
      }
    }

    if (termState.docFreq > BLOCK_SIZE) {
      termState.skipOffset = in.readVLong();
    } else {
      termState.skipOffset = -1;
    }
  }

其实ste持有term的引用

main[2] dump ste.term.ref.bytes
 ste.term.ref.bytes = {
97, 109, 0, 0, 0, 0, 0, 0
}
main[2] where
  [2] org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnumFrame.decodeMetaData (SegmentTermsEnumFrame.java:476)
  [3] org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum.termState (SegmentTermsEnum.java:1,178)
  [4] org.apache.lucene.index.TermStates.build (TermStates.java:104)
  [5] org.apache.lucene.search.TermQuery.createWeight (TermQuery.java:227)
  [6] org.apache.lucene.search.IndexSearcher.createWeight (IndexSearcher.java:885)
  [7] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:686)
  [8] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
  [9] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
  [10] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
  [11] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

ste.in 描述的是读取的文件:

 ste.in = {
    $assertionsDisabled: true
    org.apache.lucene.store.ByteBufferIndexInput.EMPTY_FLOATBUFFER: instance of java.nio.HeapFloatBuffer(id=1473)
    org.apache.lucene.store.ByteBufferIndexInput.EMPTY_LONGBUFFER: instance of java.nio.HeapLongBuffer(id=1474)
    org.apache.lucene.store.ByteBufferIndexInput.EMPTY_INTBUFFER: instance of java.nio.HeapIntBuffer(id=1475)
    org.apache.lucene.store.ByteBufferIndexInput.length: 1993
    org.apache.lucene.store.ByteBufferIndexInput.chunkSizeMask: 1073741823
    org.apache.lucene.store.ByteBufferIndexInput.chunkSizePower: 30
    org.apache.lucene.store.ByteBufferIndexInput.guard: instance of org.apache.lucene.store.ByteBufferGuard(id=1476)
    org.apache.lucene.store.ByteBufferIndexInput.buffers: instance of java.nio.ByteBuffer[1] (id=1477)
    org.apache.lucene.store.ByteBufferIndexInput.curBufIndex: 0
    org.apache.lucene.store.ByteBufferIndexInput.curBuf: instance of java.nio.DirectByteBufferR(id=1479)
    org.apache.lucene.store.ByteBufferIndexInput.curLongBufferViews: null
    org.apache.lucene.store.ByteBufferIndexInput.curIntBufferViews: null
    org.apache.lucene.store.ByteBufferIndexInput.curFloatBufferViews: null
    org.apache.lucene.store.ByteBufferIndexInput.isClone: true
    org.apache.lucene.store.ByteBufferIndexInput.$assertionsDisabled: true
    org.apache.lucene.store.IndexInput.resourceDescription: "MMapIndexInput(path="/home/dai/index/_7.cfs") [slice=_7_Lucene90_0.tim]"
}

相关阅读

  public void nextLeaf() {
    // if (DEBUG) System.out.println("  frame.next ord=" + ord + " nextEnt=" + nextEnt + "
    // entCount=" + entCount);
    assert nextEnt != -1 && nextEnt < entCount
        : "nextEnt=" + nextEnt + " entCount=" + entCount + " fp=" + fp;
    nextEnt++;
    suffix = suffixLengthsReader.readVInt();
    startBytePos = suffixesReader.getPosition();
    ste.term.setLength(prefix + suffix);
    ste.term.grow(ste.term.length());
    suffixesReader.readBytes(ste.term.bytes(), prefix, suffix);
    ste.termExists = true;
  }

  public boolean nextNonLeaf() throws IOException {
    // if (DEBUG) System.out.println("  stef.next ord=" + ord + " nextEnt=" + nextEnt + " entCount="
    // + entCount + " fp=" + suffixesReader.getPosition());
    while (true) {
      if (nextEnt == entCount) {
        assert arc == null || (isFloor && isLastInFloor == false)
            : "isFloor=" + isFloor + " isLastInFloor=" + isLastInFloor;
        loadNextFloorBlock();
        if (isLeafBlock) {
          nextLeaf();
          return false;
        } else {
          continue;
        }
      }

      assert nextEnt != -1 && nextEnt < entCount
          : "nextEnt=" + nextEnt + " entCount=" + entCount + " fp=" + fp;
      nextEnt++;
      final int code = suffixLengthsReader.readVInt();
      suffix = code >>> 1;
      startBytePos = suffixesReader.getPosition();
      ste.term.setLength(prefix + suffix);
      ste.term.grow(ste.term.length());
      suffixesReader.readBytes(ste.term.bytes(), prefix, suffix);   // 这里是最核心的地方吗?
      if ((code & 1) == 0) {
        // A normal term
        ste.termExists = true;
        subCode = 0;
        state.termBlockOrd++;
        return false;
      } else {
        // A sub-block; make sub-FP absolute:
        ste.termExists = false;
        subCode = suffixLengthsReader.readVLong();
        lastSubFP = fp - subCode;
        // if (DEBUG) {
        // System.out.println("    lastSubFP=" + lastSubFP);
        // }
        return true;
      }
    }
  }

看上去这就行读取term 在文件中的位置信息:

main[1] where
  [1] org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnumFrame.scanToTermLeaf (SegmentTermsEnumFrame.java:593)
  [2] org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnumFrame.scanToTerm (SegmentTermsEnumFrame.java:530)
  [3] org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum.seekExact (SegmentTermsEnum.java:538)
  [4] org.apache.lucene.index.TermStates.loadTermsEnum (TermStates.java:117)
  [5] org.apache.lucene.index.TermStates.build (TermStates.java:102)
  [6] org.apache.lucene.search.TermQuery.createWeight (TermQuery.java:227)
  [7] org.apache.lucene.search.IndexSearcher.createWeight (IndexSearcher.java:885)
  [8] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:686)
  [9] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
  [10] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
  [11] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
  [12] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)
main[1] dump suffixBytes
 suffixBytes = {
97, 109, 97, 110, 100, 98, 117, 116, 99, 97, 110, 100, 111, 104, 101, 108, 108, 111, 104, 105, 105, 105, 115, 105, 116, 107, 110, 111, 119, 109, 97, 121, 109, 111, 110, 103, 111, 110, 111, 116, 116, 114, 121, 119, 104, 97, 116, 119, 111, 114, 108, 100, 121, 111, 117, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
}

term 对应docfreq 的统计信息的读取位置

main[1] list
451          // postings
452    
453          // TODO: if docFreq were bulk decoded we could
454          // just skipN here:
455 =>       if (statsSingletonRunLength > 0) {
456            state.docFreq = 1;
457            state.totalTermFreq = 1;
458            statsSingletonRunLength--;
459          } else {
460            int token = statsReader.readVInt();
main[1] print statsSingletonRunLength
 statsSingletonRunLength = 0
main[1] next
> 
Step completed: "thread=main", org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnumFrame.decodeMetaData(), line=460 bci=80
460            int token = statsReader.readVInt();

main[1] list
456            state.docFreq = 1;
457            state.totalTermFreq = 1;
458            statsSingletonRunLength--;
459          } else {
460 =>         int token = statsReader.readVInt();
461            if ((token & 1) == 1) {
462              state.docFreq = 1;
463              state.totalTermFreq = 1;
464              statsSingletonRunLength = token >>> 1;
465            } else {
main[1] print statsReader
 statsReader = "org.apache.lucene.store.ByteArrayDataInput@6b67034"
main[1] dump statsReader
 statsReader = {
    bytes: instance of byte[64] (id=1520)
    pos: 0
    limit: 16
}
main[1] dump statsReader.bytes
 statsReader.bytes = {
4, 0, 9, 2, 1, 4, 0, 3, 2, 1, 1, 2, 1, 7, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
}

搜索的termam对应的是

00000000  3f d7 6c 17 12 42 6c 6f  63 6b 54 72 65 65 54 65  |?.l..BlockTreeTe|
00000010  72 6d 73 44 69 63 74 00  00 00 00 fe ea 80 e6 45  |rmsDict........E|
00000020  20 d8 56 64 1b 1b 1b 89  70 fe 67 0a 4c 75 63 65  | .Vd....p.g.Luce|
00000030  6e 65 39 30 5f 30 25 bc  03 61 6d 61 6e 64 62 75  |ne90_0%..amandbu|
00000040  74 63 61 6e 64 6f 68 65  6c 6c 6f 68 69 69 69 73  |tcandohellohiiis|
00000050  69 74 6b 6e 6f 77 6d 61  79 6d 6f 6e 67 6f 6e 6f  |itknowmaymongono|
00000060  74 74 72 79 77 68 61 74  77 6f 72 6c 64 79 6f 75  |ttrywhatworldyou|
00000070  24 02 03 03 03 02 05 02  01 02 02 04 03 05 03 03  |$...............|
00000080  04 05 03 10 04 00 09 02  01 04 00 03 02 01 01 02  |................|   <----  在这一行第四个开始的序列
00000090  01 07 02 02 26 7a 3d 04  01 02 03 01 01 01 01 01  |....&z=.........|
000000a0  05 01 01 01 00 02 04 00  02 01 01 01 01 01 02 01  |................|
000000b0  01 01 02 01 01 01 01 05  01 03 01 05 a4 03 2f 68  |............../h|
000000c0  6f 6d 65 2f 75 62 75 6e  74 75 2f 64 6f 63 2f 68  |ome/ubuntu/doc/h|
000000d0  65 6c 6c 6f 2e 74 78 74  2f 68 6f 6d 65 2f 75 62  |ello.txt/home/ub|
000000e0  75 6e 74 75 2f 64 6f 63  2f 6d 6f 6e 67 6f 2e 74  |untu/doc/mongo.t|
000000f0  78 74 05 1a 01 03 04 82  01 01 03 c0 28 93 e8 00  |xt..........(...|
00000100  00 00 00 00 00 00 00 da  02 a3 a3                 |...........|

那么docFreq 的赋值在哪里呢?

 currentFrame.state.docFreq = 2
main[1] list
1,113        assert !eof;
1,114        // if (DEBUG) System.out.println("BTR.docFreq");
1,115        currentFrame.decodeMetaData();
1,116        // if (DEBUG) System.out.println("  return " + currentFrame.state.docFreq);
1,117 =>     return currentFrame.state.docFreq;
1,118      }
1,119    
1,120      @Override
1,121      public long totalTermFreq() throws IOException {
1,122        assert !eof;
main[1] where
  [1] org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum.docFreq (SegmentTermsEnum.java:1,117)
  [2] org.apache.lucene.index.TermStates.build (TermStates.java:107)
  [3] org.apache.lucene.search.TermQuery.createWeight (TermQuery.java:227)
  [4] org.apache.lucene.search.IndexSearcher.createWeight (IndexSearcher.java:885)
  [5] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:686)
  [6] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
  [7] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
  [8] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
  [9] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

读取的过程:

readByte:110, ByteBufferIndexInput (org.apache.lucene.store)
readVInt:121, DataInput (org.apache.lucene.store)
readVIntBlock:149, Lucene90PostingsReader (org.apache.lucene.codecs.lucene90)
refillDocs:472, Lucene90PostingsReader$BlockDocsEnum (org.apache.lucene.codecs.lucene90)
advance:538, Lucene90PostingsReader$BlockDocsEnum (org.apache.lucene.codecs.lucene90)
advance:77, SlowImpactsEnum (org.apache.lucene.index)
advance:128, ImpactsDISI (org.apache.lucene.search)
nextDoc:133, ImpactsDISI (org.apache.lucene.search)
scoreAll:301, Weight$DefaultBulkScorer (org.apache.lucene.search)
score:247, Weight$DefaultBulkScorer (org.apache.lucene.search)
score:38, BulkScorer (org.apache.lucene.search)
search:776, IndexSearcher (org.apache.lucene.search)
search:694, IndexSearcher (org.apache.lucene.search)
search:688, IndexSearcher (org.apache.lucene.search)
searchAfter:523, IndexSearcher (org.apache.lucene.search)
search:538, IndexSearcher (org.apache.lucene.search)
doPagingSearch:161, SearchFiles (com.dinosaur.lucene.skiptest)
queryTest:52, QueryTest (com.dinosaur.lucene.demo)

tim 文件在哪里初始化

  void loadBlock() throws IOException {

    // Clone the IndexInput lazily, so that consumers
    // that just pull a TermsEnum to
    // seekExact(TermState) don't pay this cost:
    ste.initIndexInput();

    if (nextEnt != -1) {
      // Already loaded
      return;
    }
    // System.out.println("blc=" + blockLoadCount);

    ste.in.seek(fp);
    int code = ste.in.readVInt();
    entCount = code >>> 1;
    assert entCount > 0;
    isLastInFloor = (code & 1) != 0;

    assert arc == null || (isLastInFloor || isFloor)
        : "fp=" + fp + " arc=" + arc + " isFloor=" + isFloor + " isLastInFloor=" + isLastInFloor;

    // TODO: if suffixes were stored in random-access
    // array structure, then we could do binary search
    // instead of linear scan to find target term; eg
    // we could have simple array of offsets

    final long startSuffixFP = ste.in.getFilePointer();
    // term suffixes:
    final long codeL = ste.in.readVLong();
    isLeafBlock = (codeL & 0x04) != 0;
    final int numSuffixBytes = (int) (codeL >>> 3);
    if (suffixBytes.length < numSuffixBytes) {
      suffixBytes = new byte[ArrayUtil.oversize(numSuffixBytes, 1)];
    }
    try {
      compressionAlg = CompressionAlgorithm.byCode((int) codeL & 0x03);
    } catch (IllegalArgumentException e) {
      throw new CorruptIndexException(e.getMessage(), ste.in, e);
    }
    compressionAlg.read(ste.in, suffixBytes, numSuffixBytes);
    suffixesReader.reset(suffixBytes, 0, numSuffixBytes);

    int numSuffixLengthBytes = ste.in.readVInt();
    final boolean allEqual = (numSuffixLengthBytes & 0x01) != 0;
    numSuffixLengthBytes >>>= 1;
    if (suffixLengthBytes.length < numSuffixLengthBytes) {
      suffixLengthBytes = new byte[ArrayUtil.oversize(numSuffixLengthBytes, 1)];
    }
    if (allEqual) {
      Arrays.fill(suffixLengthBytes, 0, numSuffixLengthBytes, ste.in.readByte());
    } else {
      ste.in.readBytes(suffixLengthBytes, 0, numSuffixLengthBytes);
    }
    suffixLengthsReader.reset(suffixLengthBytes, 0, numSuffixLengthBytes);
    totalSuffixBytes = ste.in.getFilePointer() - startSuffixFP;

    /*if (DEBUG) {
    if (arc == null) {
    System.out.println("    loadBlock (next) fp=" + fp + " entCount=" + entCount + " prefixLen=" + prefix + " isLastInFloor=" + isLastInFloor + " leaf?=" + isLeafBlock);
    } else {
    System.out.println("    loadBlock (seek) fp=" + fp + " entCount=" + entCount + " prefixLen=" + prefix + " hasTerms?=" + hasTerms + " isFloor?=" + isFloor + " isLastInFloor=" + isLastInFloor + " leaf?=" + isLeafBlock);
    }
    }*/

    // stats
    int numBytes = ste.in.readVInt();
    if (statBytes.length < numBytes) {
      statBytes = new byte[ArrayUtil.oversize(numBytes, 1)];
    }
    ste.in.readBytes(statBytes, 0, numBytes);
    statsReader.reset(statBytes, 0, numBytes);
    statsSingletonRunLength = 0;
    metaDataUpto = 0;

    state.termBlockOrd = 0;
    nextEnt = 0;
    lastSubFP = -1;

    // TODO: we could skip this if !hasTerms; but
    // that's rare so won't help much
    // metadata
    numBytes = ste.in.readVInt();
    if (bytes.length < numBytes) {
      bytes = new byte[ArrayUtil.oversize(numBytes, 1)];
    }
    ste.in.readBytes(bytes, 0, numBytes);
    bytesReader.reset(bytes, 0, numBytes);

    // Sub-blocks of a single floor block are always
    // written one after another -- tail recurse:
    fpEnd = ste.in.getFilePointer();
    // if (DEBUG) {
    //   System.out.println("      fpEnd=" + fpEnd);
    // }
  }
我们知道Lucene将索引文件拆分为了多个文件,这里我们仅讨论倒排索引部分。

Lucene把用于存储Term的索引文件叫Terms Index,它的后缀是.tip;把Postings信息分别存储在.doc、.pay、.pox,分别记录Postings的DocId信息和Term的词频、Payload信息、pox是记录位置信息。Terms Dictionary的文件后缀称为.tim,它是Term与Postings的关系纽带,存储了Term和其对应的Postings文件指针。

总体来说,通过Terms Index(.tip)能够快速地在Terms Dictionary(.tim)中找到你的想要的Term,以及它对应的Postings文件指针与Term在Segment作用域上的统计信息。


postings: 实际上Postings包含的东西并不仅仅是DocIDs(我们通常把这一个有序文档编号系列叫DocIDs),它还包括文档编号、以及词频、Term在文档中的位置信息、还有Payload数据。

所以关于倒排索引至少涉及5类文件,本文不会全面展开。

相关阅读