背景
FST 即finite state machine
,lucene很多内容都是用这个格式压缩和存储的.
fst 例子
介绍FST
之前,先看看Hashmap
.
HashMap
的语义: key-> value
, 也就是输入一个key
,返回一个value
FST
结构也是一个特别的Map
, 语义和Map
差不多:FST(key)=value
例子
有下面组词汇的数组[cat:5,dog:7,dogs:13]
- key为
cat
,value为5
- key为
dog
,value为7
- key为
dogs
,value为13
最后会被序列化成这个结构
[0, 116, 15, 97, 6, 5, 115, 31, 103, 7, 111, 6, 7, 100, 22, 4, 5, 99, 16]
下面来分析这个例子的每个字节
103, 7 | 111, 6 | 7, 100, 22 | 4, 5, 99, 16 |
---|---|---|---|
flag=7,value:‘g’也就是103,target:7,nextArch=7 | flag=6,value:‘0’也就是111,target:9,nextArch=9 | flag=22 , value=‘d’ 也就是100,output=7,target=11(为什么是11 ?因为7前面就是pos=11,nextArc=11) | flag=16 , value:‘c’也就是99, output =5,target=4,nextArc=14 |
[0, 116, 15,| 97, 6, |5, 115, 31, |103, 7, |111, 6, |7, 100, 22,| 4, 5, 99, 16]
--------| ------- | -----------| ------ | ------- |--------- | -------------
(t,null)| (a,null)| (s,5) | (g,null)| (o,null)| (d:7) | (output =5,target=4,flag=16 value:'c')
常量解释:
常量 | 值 | 描述 |
---|---|---|
BIT_LAST_ARC | 1>>1 | 描述该弧是最后一个弧,类似:二叉树右子节点;或者类似于三叉树的第三个节点 |
BIT_ARC_HAS_OUTPUT | 1>>4 | 有output,也就是这个节点存了一些值 |
BIT_TARGET_NEXT | 1>>2 | 表示该节点的下一个节点就是下一个bit,不需要在另外存了,也就是这个弧的两个节点是存在一起的 |
arc class分析
arc class 源码如下:
public static final class Arc<T> {
// *** Arc fields.
private int label;
private T output;
private long target;
private byte flags;
private T nextFinalOutput;
private long nextArc;
private byte nodeFlags;
// *** Fields for arcs belonging to a node with fixed length arcs.
// So only valid when bytesPerArc != 0.
// nodeFlags == ARCS_FOR_BINARY_SEARCH || nodeFlags == ARCS_FOR_DIRECT_ADDRESSING.
private int bytesPerArc;
private long posArcsStart;
private int arcIdx;
private int numArcs;
// *** Fields for a direct addressing node. nodeFlags == ARCS_FOR_DIRECT_ADDRESSING.
/**
* Start position in the {@link FST.BytesReader} of the presence bits for a direct addressing
* node, aka the bit-table
*/
private long bitTableStart;
/** First label of a direct addressing node. */
private int firstLabel;
/**
* Index of the current label of a direct addressing node. While {@link #arcIdx} is the current
* index in the label range, {@link #presenceIndex} is its corresponding index in the list of
* actually present labels. It is equal to the number of bits set before the bit at {@link
* #arcIdx} in the bit-table. This field is a cache to avoid to count bits set repeatedly when
* iterating the next arcs.
*/
private int presenceIndex;
}
字段 | 描述 |
---|---|
label | 如果用map的key value来举例 , label就是key的一截 , 多个lebel会组成一个key , 举例 “cat” 会拆分成三个label : “c” , “a”, “t” |
output | 如果是用map的key value来举例 , output就是value的一截,多个output会组成一个value |
target | 描述的是下一个节点的偏移量,一个弧度如果是src -> dst 这样结构的话 , target 就是dst 的位置 也就是 arr[target] 就是dst 的节点的位置 |
flags | 各种奇奇怪怪的标志位来标识这个弧的状态,用位图来将各种状态压缩 |
nextFinalOutput | 前面说了,如果这个key value 结构 , 这个描述的是value的最后一截 ,否则就是null |
nextArc | 描述的是多个弧,就像一个多叉树的兄弟节点,这个是描述下一个兄弟节点的偏移位置 |
numArcs | 描述的是这个阶段有多少个弧,也就是这个节点有多少个子节点 |
写入过程:
add:473, FSTCompiler (org.apache.lucene.util.fst)
compileIndex:504, Lucene90BlockTreeTermsWriter$PendingBlock (org.apache.lucene.codecs.lucene90.blocktree)
writeBlocks:725, Lucene90BlockTreeTermsWriter$TermsWriter (org.apache.lucene.codecs.lucene90.blocktree)
finish:1105, Lucene90BlockTreeTermsWriter$TermsWriter (org.apache.lucene.codecs.lucene90.blocktree)
write:370, Lucene90BlockTreeTermsWriter (org.apache.lucene.codecs.lucene90.blocktree)
write:172, PerFieldPostingsFormat$FieldsWriter (org.apache.lucene.codecs.perfield)
flush:135, FreqProxTermsWriter (org.apache.lucene.index)
flush:310, IndexingChain (org.apache.lucene.index)
flush:392, DocumentsWriterPerThread (org.apache.lucene.index)
doFlush:492, DocumentsWriter (org.apache.lucene.index)
flushAllThreads:671, DocumentsWriter (org.apache.lucene.index)
doFlush:4194, IndexWriter (org.apache.lucene.index)
flush:4168, IndexWriter (org.apache.lucene.index)
shutdown:1322, IndexWriter (org.apache.lucene.index)
close:1362, IndexWriter (org.apache.lucene.index)
doTestSearch:133, FstTest (com.dinosaur.lucene.demo)
findTargetArc:1418, FST (org.apache.lucene.util.fst)
seekExact:511, SegmentTermsEnum (org.apache.lucene.codecs.lucene90.blocktree)
loadTermsEnum:111, TermStates (org.apache.lucene.index)
build:96, TermStates (org.apache.lucene.index)
createWeight:227, TermQuery (org.apache.lucene.search)
createWeight:904, IndexSearcher (org.apache.lucene.search)
search:687, IndexSearcher (org.apache.lucene.search)
searchAfter:523, IndexSearcher (org.apache.lucene.search)
search:538, IndexSearcher (org.apache.lucene.search)
doPagingSearch:158, SearchFiles (com.dinosaur.lucene.demo)
testSearch:128, SearchFiles (com.dinosaur.lucene.demo)
跳转内容
如何跳转
public void decodeMetaData() throws IOException {
// if (DEBUG) System.out.println("\nBTTR.decodeMetadata seg=" + segment + " mdUpto=" +
// metaDataUpto + " vs termBlockOrd=" + state.termBlockOrd);
// lazily catch up on metadata decode:
final int limit = getTermBlockOrd();
boolean absolute = metaDataUpto == 0;
assert limit > 0;
// TODO: better API would be "jump straight to term=N"???
while (metaDataUpto < limit) {
// TODO: we could make "tiers" of metadata, ie,
// decode docFreq/totalTF but don't decode postings
// metadata; this way caller could get
// docFreq/totalTF w/o paying decode cost for
// postings
// TODO: if docFreq were bulk decoded we could
// just skipN here:
if (statsSingletonRunLength > 0) {
state.docFreq = 1;
state.totalTermFreq = 1;
statsSingletonRunLength--;
} else {
int token = statsReader.readVInt();
if ((token & 1) == 1) {
state.docFreq = 1;
state.totalTermFreq = 1;
statsSingletonRunLength = token >>> 1;
} else {
state.docFreq = token >>> 1;
if (ste.fr.fieldInfo.getIndexOptions() == IndexOptions.DOCS) {
state.totalTermFreq = state.docFreq;
} else {
state.totalTermFreq = state.docFreq + statsReader.readVLong();
}
}
}
// metadata
ste.fr.parent.postingsReader.decodeTerm(bytesReader, ste.fr.fieldInfo, state, absolute);
metaDataUpto++;
absolute = false;
}
state.termBlockOrd = metaDataUpto;
}
相关阅读
- https://www.shenyanchao.cn/blog/2018/12/04/lucene-fst/
- https://blog.csdn.net/wang_hnust/article/details/87950746
- https://blog.51cto.com/sbp810050504/1361551
- https://developer.aliyun.com/article/906165
- https://blog.csdn.net/conansonic/article/details/52091301
- 证明论文
- 知乎相关论文
- https://cs.nyu.edu/~mohri/pub/hwa.pdf