Skip to content

lucene 搜索过程

Posted on:June 19, 2023 at 12:58 PM

背景

了解lucene的搜索过程:

代码堆栈

add:473, FSTCompiler (org.apache.lucene.util.fst)
compileIndex:504, Lucene90BlockTreeTermsWriter$PendingBlock (org.apache.lucene.codecs.lucene90.blocktree)
writeBlocks:725, Lucene90BlockTreeTermsWriter$TermsWriter (org.apache.lucene.codecs.lucene90.blocktree)
finish:1105, Lucene90BlockTreeTermsWriter$TermsWriter (org.apache.lucene.codecs.lucene90.blocktree)
write:370, Lucene90BlockTreeTermsWriter (org.apache.lucene.codecs.lucene90.blocktree)
write:172, PerFieldPostingsFormat$FieldsWriter (org.apache.lucene.codecs.perfield)
flush:135, FreqProxTermsWriter (org.apache.lucene.index)
flush:310, IndexingChain (org.apache.lucene.index)
flush:392, DocumentsWriterPerThread (org.apache.lucene.index)
doFlush:492, DocumentsWriter (org.apache.lucene.index)
flushAllThreads:671, DocumentsWriter (org.apache.lucene.index)
doFlush:4194, IndexWriter (org.apache.lucene.index)
flush:4168, IndexWriter (org.apache.lucene.index)
shutdown:1322, IndexWriter (org.apache.lucene.index)
close:1362, IndexWriter (org.apache.lucene.index)
doTestSearch:133, FstTest (com.dinosaur.lucene.demo)
findTargetArc:1418, FST (org.apache.lucene.util.fst)
seekExact:511, SegmentTermsEnum (org.apache.lucene.codecs.lucene90.blocktree)
loadTermsEnum:111, TermStates (org.apache.lucene.index)
build:96, TermStates (org.apache.lucene.index)
createWeight:227, TermQuery (org.apache.lucene.search)
createWeight:904, IndexSearcher (org.apache.lucene.search)
search:687, IndexSearcher (org.apache.lucene.search)
searchAfter:523, IndexSearcher (org.apache.lucene.search)
search:538, IndexSearcher (org.apache.lucene.search)
doPagingSearch:158, SearchFiles (com.dinosaur.lucene.demo)
testSearch:128, SearchFiles (com.dinosaur.lucene.demo)

例子

cfe 文件

$ hexdump  app/index/_3.cfs
000000  3f d7 6c 17 14 4c 75 63 65 6e 65 39 30 43 6f 6d 
000010  70 6f 75 6e 64 44 61 74 61 00 00 00 00 7a fc 30 
000020  52 e0 51 d2 54 be 49 7f 21 78 69 fe c4 00 00 00 
000030  3f d7 6c 17 11 4c 75 63 65 6e 65 39 30 4e 6f 72 
000040  6d 73 44 61 74 61 00 00 00 00 7a fc 30 52 e0 51 
000050  d2 54 be 49 7f 21 78 69 fe c4 00 04 03 c0 28 93 
000060  e8 00 00 00 00 00 00 00 00 f0 6a f4 62 00 00 00 
000070  3f d7 6c 17 16 4c 75 63 65 6e 65 39 30 46 69 65 
000080  6c 64 73 49 6e 64 65 78 49 64 78 00 00 00 00 7a
000090  fc 30 52 e0 51 d2 54 be 49 7f 21 78 69 fe c4 00
0000a0  c0 28 93 e8 00 00 00 00 00 00 00 00 92 7f 21 bb
0000b0  3f d7 6c 17 19 4c 75 63 65 6e 65 39 30 50 6f 69
0000c0  6e 74 73 46 6f 72 6d 61 74 49 6e 64 65 78 00 00
0000d0  00 00 7a fc 30 52 e0 51 d2 54 be 49 7f 21 78 69
0000e0  fe c4 00 32 c0 28 93 e8 00 00 00 00 00 00 00 00
0000f0  f7 61 6e 2f 00 00 00 00 3f d7 6c 17 13 42 6c 6f
000100  63 6b 54 72 65 65 54 65 72 6d 73 49 6e 64 65 78
000110  00 00 00 00 7a fc 30 52 e0 51 d2 54 be 49 7f 21
000120  78 69 fe c4 0a 4c 75 63 65 6e 65 39 30 5f 30 00
000130  00 c0 28 93 e8 00 00 00 00 00 00 00 00 07 1a 7b
000140  47 00 00 00 00 00 00 00 3f d7 6c 17 18 4c 75 63
000150  65 6e 65 39 30 50 6f 69 6e 74 73 46 6f 72 6d 61
000160  74 44 61 74 61 00 00 00 00 7a fc 30 52 e0 51 d2
000170  54 be 49 7f 21 78 69 fe c4 00 02 fe 00 08 80 00
000180  01 88 d2 0f 28 0d ff c0 28 93 e8 00 00 00 00 00
000190  00 00 00 6d 43 fa 6e 00 3f d7 6c 17 19 4c 75 63
0001a0  65 6e 65 39 30 50 6f 73 74 69 6e 67 73 57 72 69
0001b0  74 65 72 44 6f 63 00 00 00 00 7a fc 30 52 e0 51
0001c0  d2 54 be 49 7f 21 78 69 fe c4 0a 4c 75 63 65 6e
0001d0  65 39 30 5f 30 01 03 01 03 c0 28 93 e8 00 00 00   <---   右边的01 03 是you的两个docid
0001e0  00 00 00 00 00 26 f5 75 88 00 00 00 00 00 00 00
0001f0  3f d7 6c 17 19 4c 75 63 65 6e 65 39 30 50 6f 73
000200  74 69 6e 67 73 57 72 69 74 65 72 50 6f 73 00 00
000210  00 00 7a fc 30 52 e0 51 d2 54 be 49 7f 21 78 69
000220  fe c4 0a 4c 75 63 65 6e 65 39 30 5f 30 02 00 00
000230  01 02 03 01 c0 28 93 e8 00 00 00 00 00 00 00 00
000240  c5 ac 32 b6 00 00 00 00 3f d7 6c 17 15 4c 75 63 
000250  65 6e 65 39 30 4e 6f 72 6d 73 4d 65 74 61 64 61
000260  74 61 00 00 00 00 7a fc 30 52 e0 51 d2 54 be 49
000270  7f 21 78 69 fe c4 00 02 00 00 00 ff ff ff ff ff
000280  ff ff ff 00 00 00 00 00 00 00 00 ff ff ff 02 00
000290  00 00 01 2b 00 00 00 00 00 00 00 ff ff ff ff c0
0002a0  28 93 e8 00 00 00 00 00 00 00 00 1c 85 f4 99 00
0002b0  3f d7 6c 17 1c 4c 75 63 65 6e 65 39 30 53 74 6f
0002c0  72 65 64 46 69 65 6c 64 73 46 61 73 74 44 61 74
0002d0  61 00 00 00 01 7a fc 30 52 e0 51 d2 54 be 49 7f
0002e0  21 78 69 fe c4 00 00 0a 00 01 08 12 13 01 04 02
0002f0  05 05 05 05 05 05 05 05 05 10 00 40 10 2e 2e 5c
000300  40 64 6f 63 73 40 5c 64 65 6d 40 6f 2e 74 78 40
000310  74 00 11 2e 40 2e 5c 64 6f 40 63 73 5c 64 40 65
000320  6d 6f 32 40 2e 74 78 74 c0 28 93 e8 00 00 00 00
000330  00 00 00 00 81 b0 7e 09 3f d7 6c 17 18 4c 75 63
000340  65 6e 65 39 30 50 6f 69 6e 74 73 46 6f 72 6d 61
000350  74 4d 65 74 61 00 00 00 00 7a fc 30 52 e0 51 d2
000360  54 be 49 7f 21 78 69 fe c4 00 01 00 00 00 3f d7
000370  6c 17 03 42 4b 44 00 00 00 09 01 01 80 04 08 01
000380  80 00 01 88 d2 0f 28 0d 80 00 01 88 d2 0f 28 0d
000390  02 02 01 32 00 00 00 00 00 00 00 33 00 00 00 00
0003a0  00 00 00 ff ff ff ff 44 00 00 00 00 00 00 00 4f 
0003b0  00 00 00 00 00 00 00 c0 28 93 e8 00 00 00 00 00
0003c0  00 00 00 02 3e 97 d6 00 3f d7 6c 17 17 4c 75 63
0003d0  65 6e 65 39 30 46 69 65 6c 64 73 49 6e 64 65 78
0003e0  4d 65 74 61 00 00 00 01 7a fc 30 52 e0 51 d2 54
0003f0  be 49 7f 21 78 69 fe c4 00 80 80 05 02 00 00 00
000400  0a 00 00 00 02 00 00 00 30 00 00 00 00 00 00 00
000410  00 00 00 00 00 00 00 00 00 00 00 40 00 00 00 00
000420  00 00 00 00 00 30 00 00 00 00 00 00 00 36 00 00
000430  00 00 00 00 00 00 00 84 42 00 00 00 00 00 00 00
000440  00 00 30 00 00 00 00 00 00 00 78 00 00 00 00 00
000450  00 00 01 01 02 c0 28 93 e8 00 00 00 00 00 00 00
000460  00 c3 23 d0 d6 00 00 00 3f d7 6c 17 12 42 6c 6f    <-------  3f
000470  63 6b 54 72 65 65 54 65 72 6d 73 44 69 63 74 00
000480  00 00 00 7a fc 30 52 e0 51 d2 54 be 49 7f 21 78
000490  69 fe c4 0a 4c 75 63 65 6e 65 39 30 5f 30 0b 9c    <--------
0004a0  01 61 72 65 68 6f 77 6f 6c 64 73 74 75 64 65 6e
0004b0  74 79 6f 75 0a 03 03 03 07 03 05 04 00 05 04 00     <------  05 04 00 05 04  是position
0004c0  0b 7a 3d 04 00 02 01 01 05 01 00 01 05 8c 02 2e      <------- 7a 3d 04  是很多位置信息
0004d0  2e 5c 64 6f 63 73 5c 64 65 6d 6f 2e 74 78 74 2e
0004e0  2e 5c 64 6f 63 73 5c 64 65 6d 6f 32 2e 74 78 74
0004f0  04 10 11 01 03 04 82 01 00 05 c0 28 93 e8 00 00
000500  00 00 00 00 00 00 1a 7f dc 45 00 00 00 00 00 00
000510  3f d7 6c 17 12 42 6c 6f 63 6b 54 72 65 65 54 65
000520  72 6d 73 4d 65 74 61 00 00 00 00 7a fc 30 52 e0
000530  51 d2 54 be 49 7f 21 78 69 fe c4 0a 4c 75 63 65
000540  6e 65 39 30 5f 30 3f d7 6c 17 1b 4c 75 63 65 6e 
000550  65 39 30 50 6f 73 74 69 6e 67 73 57 72 69 74 65
000560  72 54 65 72 6d 73 00 00 00 00 7a fc 30 52 e0 51
000570  d2 54 be 49 7f 21 78 69 fe c4 0a 4c 75 63 65 6e
000580  65 39 30 5f 30 80 01 02 02 05 02 da 01 07 07 02
000590  03 61 72 65 03 79 6f 75 37 3f d7 6c 17 03 46 53
0005a0  54 00 00 00 08 01 03 01 da 02 00 00 01 00 02 02
0005b0  92 03 02 02 10 2e 2e 5c 64 6f 63 73 5c 64 65 6d
0005c0  6f 2e 74 78 74 11 2e 2e 5c 64 6f 63 73 5c 64 65
0005d0  6d 6f 32 2e 74 78 74 38 3f d7 6c 17 03 46 53 54
0005e0  00 00 00 08 01 03 03 92 02 00 00 01 49 00 00 00
0005f0  00 00 00 00 a2 00 00 00 00 00 00 00 c0 28 93 e8
000600  00 00 00 00 00 00 00 00 c9 44 df a8 00 00 00 00
000610  3f d7 6c 17 12 4c 75 63 65 6e 65 39 34 46 69 65
000620  6c 64 49 6e 66 6f 73 00 00 00 00 7a fc 30 52 e0
000630  51 d2 54 be 49 7f 21 78 69 fe c4 00 03 04 70 61
000640  74 68 00 02 01 00 ff ff ff ff ff ff ff ff 02 1d
000650  50 65 72 46 69 65 6c 64 50 6f 73 74 69 6e 67 73
000660  46 6f 72 6d 61 74 2e 66 6f 72 6d 61 74 08 4c 75
000670  63 65 6e 65 39 30 1d 50 65 72 46 69 65 6c 64 50
000680  6f 73 74 69 6e 67 73 46 6f 72 6d 61 74 2e 73 75
000690  66 66 69 78 01 30 00 00 01 00 08 6d 6f 64 69 66
0006a0  69 65 64 01 00 00 00 ff ff ff ff ff ff ff ff 00
0006b0  01 01 08 00 01 00 08 63 6f 6e 74 65 6e 74 73 02
0006c0  00 03 00 ff ff ff ff ff ff ff ff 02 1d 50 65 72
0006d0  46 69 65 6c 64 50 6f 73 74 69 6e 67 73 46 6f 72
0006e0  6d 61 74 2e 66 6f 72 6d 61 74 08 4c 75 63 65 6e
0006f0  65 39 30 1d 50 65 72 46 69 65 6c 64 50 6f 73 74
000700  69 6e 67 73 46 6f 72 6d 61 74 2e 73 75 66 66 69
000710  78 01 30 00 00 01 00 c0 28 93 e8 00 00 00 00 00
000720  00 00 00 36 55 24 d2 c0 28 93 e8 00 00 00 00 00
000730  00 00 00 41 6a 49 d4

tim文件的偏移是offset=1128 tim 文件

score:250, BM25Similarity$BM25Scorer (org.apache.lucene.search.similarities)
score:60, LeafSimScorer (org.apache.lucene.search)
score:75, TermScorer (org.apache.lucene.search)
collect:73, TopScoreDocCollector$SimpleTopScoreDocCollector$1 (org.apache.lucene.search)
scoreAll:305, Weight$DefaultBulkScorer (org.apache.lucene.search)
score:247, Weight$DefaultBulkScorer (org.apache.lucene.search)
score:38, BulkScorer (org.apache.lucene.search)
search:776, IndexSearcher (org.apache.lucene.search)
search:694, IndexSearcher (org.apache.lucene.search)
search:688, IndexSearcher (org.apache.lucene.search)
searchAfter:523, IndexSearcher (org.apache.lucene.search)
search:538, IndexSearcher (org.apache.lucene.search)
doPagingSearch:161, SearchFiles (com.dinosaur.lucene.skiptest)
readField:248, Lucene90CompressingStoredFieldsReader (org.apache.lucene.codecs.lucene90.compressing)
document:642, Lucene90CompressingStoredFieldsReader (org.apache.lucene.codecs.lucene90.compressing)
document:253, SegmentReader (org.apache.lucene.index)
document:171, BaseCompositeReader (org.apache.lucene.index)
document:411, IndexReader (org.apache.lucene.index)
doc:390, IndexSearcher (org.apache.lucene.search)
doPagingSearch:195, SearchFiles (com.dinosaur.lucene.skiptest)

tim/tip/doc 关系

tip 是描述一个term的指针 tim 包含term的统计信息 doc 描述的是term对应的docId

也就是说 tip -> tim -> doc

doc file

<init>:74, Lucene90PostingsReader (org.apache.lucene.codecs.lucene90)
fieldsProducer:424, Lucene90PostingsFormat (org.apache.lucene.codecs.lucene90)
<init>:330, PerFieldPostingsFormat$FieldsReader (org.apache.lucene.codecs.perfield)
fieldsProducer:392, PerFieldPostingsFormat (org.apache.lucene.codecs.perfield)
<init>:118, SegmentCoreReaders (org.apache.lucene.index)
<init>:92, SegmentReader (org.apache.lucene.index)
doBody:94, StandardDirectoryReader$1 (org.apache.lucene.index)
doBody:77, StandardDirectoryReader$1 (org.apache.lucene.index)
run:816, SegmentInfos$FindSegmentsFile (org.apache.lucene.index)
open:109, StandardDirectoryReader (org.apache.lucene.index)
open:67, StandardDirectoryReader (org.apache.lucene.index)
open:60, DirectoryReader (org.apache.lucene.index)
doSearchDemo:25, SimpleSearchTest (com.dinosaur.lucene.demo)

how to find the docId list

org/apache/lucene/codecs/lucene90/Lucene90PostingsReader.java

  final class BlockDocsEnum extends PostingsEnum {

    ...

    public PostingsEnum reset(IntBlockTermState termState, int flags) throws IOException {
      docFreq = termState.docFreq;
      totalTermFreq = indexHasFreq ? termState.totalTermFreq : docFreq;
      docTermStartFP = termState.docStartFP;
      skipOffset = termState.skipOffset;
      singletonDocID = termState.singletonDocID;
      if (docFreq > 1) {
        if (docIn == null) {
          // lazy init
          docIn = startDocIn.clone();
        }
        docIn.seek(docTermStartFP);
      }

      doc = -1;
      this.needsFreq = PostingsEnum.featureRequested(flags, PostingsEnum.FREQS);
      this.isFreqsRead = true;
      if (indexHasFreq == false || needsFreq == false) {
        for (int i = 0; i < ForUtil.BLOCK_SIZE; ++i) {
          freqBuffer[i] = 1;
        }
      }
      accum = 0;
      blockUpto = 0;
      nextSkipDoc = BLOCK_SIZE - 1; // we won't skip if target is found in first block
      docBufferUpto = BLOCK_SIZE;
      skipped = false;
      return this;
    }
}

相关阅读