es编译
gradle idea
跑了很久
BUILD SUCCESSFUL in 49m 34s 334 actionable tasks: 334 executed
es 堆栈
prepareRequest:61, RestCatAction (org.elasticsearch.rest.action.cat)
handleRequest:80, BaseRestHandler (org.elasticsearch.rest)
handleRequest:69, SecurityRestFilter (org.elasticsearch.xpack.security.rest)
dispatchRequest:240, RestController (org.elasticsearch.rest)
tryAllHandlers:337, RestController (org.elasticsearch.rest)
dispatchRequest:174, RestController (org.elasticsearch.rest)
dispatchRequest:324, AbstractHttpServerTransport (org.elasticsearch.http)
handleIncomingRequest:374, AbstractHttpServerTransport (org.elasticsearch.http)
incomingRequest:303, AbstractHttpServerTransport (org.elasticsearch.http)
channelRead0:66, Netty4HttpRequestHandler (org.elasticsearch.http.netty4)
channelRead0:31, Netty4HttpRequestHandler (org.elasticsearch.http.netty4)
channelRead:105, SimpleChannelInboundHandler (io.netty.channel)
invokeChannelRead:362, AbstractChannelHandlerContext (io.netty.channel)
invokeChannelRead:348, AbstractChannelHandlerContext (io.netty.channel)
fireChannelRead:340, AbstractChannelHandlerContext (io.netty.channel)
channelRead:58, Netty4HttpPipeliningHandler (org.elasticsearch.http.netty4)
invokeChannelRead:362, AbstractChannelHandlerContext (io.netty.channel)
invokeChannelRead:348, AbstractChannelHandlerContext (io.netty.channel)
fireChannelRead:340, AbstractChannelHandlerContext (io.netty.channel)
channelRead:102, MessageToMessageDecoder (io.netty.handler.codec)
channelRead:111, MessageToMessageCodec (io.netty.handler.codec)
invokeChannelRead:362, AbstractChannelHandlerContext (io.netty.channel)
invokeChannelRead:348, AbstractChannelHandlerContext (io.netty.channel)
fireChannelRead:340, AbstractChannelHandlerContext (io.netty.channel)
channelRead:102, MessageToMessageDecoder (io.netty.handler.codec)
invokeChannelRead:362, AbstractChannelHandlerContext (io.netty.channel)
invokeChannelRead:348, AbstractChannelHandlerContext (io.netty.channel)
fireChannelRead:340, AbstractChannelHandlerContext (io.netty.channel)
channelRead:102, MessageToMessageDecoder (io.netty.handler.codec)
invokeChannelRead:362, AbstractChannelHandlerContext (io.netty.channel)
invokeChannelRead:348, AbstractChannelHandlerContext (io.netty.channel)
fireChannelRead:340, AbstractChannelHandlerContext (io.netty.channel)
fireChannelRead:323, ByteToMessageDecoder (io.netty.handler.codec)
channelRead:297, ByteToMessageDecoder (io.netty.handler.codec)
invokeChannelRead:362, AbstractChannelHandlerContext (io.netty.channel)
invokeChannelRead:348, AbstractChannelHandlerContext (io.netty.channel)
fireChannelRead:340, AbstractChannelHandlerContext (io.netty.channel)
channelRead:286, IdleStateHandler (io.netty.handler.timeout)
invokeChannelRead:362, AbstractChannelHandlerContext (io.netty.channel)
invokeChannelRead:348, AbstractChannelHandlerContext (io.netty.channel)
fireChannelRead:340, AbstractChannelHandlerContext (io.netty.channel)
channelRead:1434, DefaultChannelPipeline$HeadContext (io.netty.channel)
invokeChannelRead:362, AbstractChannelHandlerContext (io.netty.channel)
invokeChannelRead:348, AbstractChannelHandlerContext (io.netty.channel)
fireChannelRead:965, DefaultChannelPipeline (io.netty.channel)
read:163, AbstractNioByteChannel$NioByteUnsafe (io.netty.channel.nio)
processSelectedKey:644, NioEventLoop (io.netty.channel.nio)
processSelectedKeysPlain:544, NioEventLoop (io.netty.channel.nio)
processSelectedKeys:498, NioEventLoop (io.netty.channel.nio)
run:458, NioEventLoop (io.netty.channel.nio)
run:897, SingleThreadEventExecutor$5 (io.netty.util.concurrent)
run:834, Thread (java.lang)
以及
handleRequest:97, BaseRestHandler (org.elasticsearch.rest)
handleRequest:69, SecurityRestFilter (org.elasticsearch.xpack.security.rest)
dispatchRequest:240, RestController (org.elasticsearch.rest)
tryAllHandlers:337, RestController (org.elasticsearch.rest)
dispatchRequest:174, RestController (org.elasticsearch.rest)
dispatchRequest:324, AbstractHttpServerTransport (org.elasticsearch.http)
handleIncomingRequest:374, AbstractHttpServerTransport (org.elasticsearch.http)
incomingRequest:303, AbstractHttpServerTransport (org.elasticsearch.http)
channelRead0:66, Netty4HttpRequestHandler (org.elasticsearch.http.netty4)
channelRead0:31, Netty4HttpRequestHandler (org.elasticsearch.http.netty4)
channelRead:105, SimpleChannelInboundHandler (io.netty.channel)
invokeChannelRead:362, AbstractChannelHandlerContext (io.netty.channel)
invokeChannelRead:348, AbstractChannelHandlerContext (io.netty.channel)
fireChannelRead:340, AbstractChannelHandlerContext (io.netty.channel)
channelRead:58, Netty4HttpPipeliningHandler (org.elasticsearch.http.netty4)
invokeChannelRead:362, AbstractChannelHandlerContext (io.netty.channel)
invokeChannelRead:348, AbstractChannelHandlerContext (io.netty.channel)
fireChannelRead:340, AbstractChannelHandlerContext (io.netty.channel)
channelRead:102, MessageToMessageDecoder (io.netty.handler.codec)
channelRead:111, MessageToMessageCodec (io.netty.handler.codec)
invokeChannelRead:362, AbstractChannelHandlerContext (io.netty.channel)
invokeChannelRead:348, AbstractChannelHandlerContext (io.netty.channel)
fireChannelRead:340, AbstractChannelHandlerContext (io.netty.channel)
channelRead:102, MessageToMessageDecoder (io.netty.handler.codec)
invokeChannelRead:362, AbstractChannelHandlerContext (io.netty.channel)
invokeChannelRead:348, AbstractChannelHandlerContext (io.netty.channel)
fireChannelRead:340, AbstractChannelHandlerContext (io.netty.channel)
channelRead:102, MessageToMessageDecoder (io.netty.handler.codec)
invokeChannelRead:362, AbstractChannelHandlerContext (io.netty.channel)
invokeChannelRead:348, AbstractChannelHandlerContext (io.netty.channel)
fireChannelRead:340, AbstractChannelHandlerContext (io.netty.channel)
fireChannelRead:323, ByteToMessageDecoder (io.netty.handler.codec)
channelRead:297, ByteToMessageDecoder (io.netty.handler.codec)
invokeChannelRead:362, AbstractChannelHandlerContext (io.netty.channel)
invokeChannelRead:348, AbstractChannelHandlerContext (io.netty.channel)
fireChannelRead:340, AbstractChannelHandlerContext (io.netty.channel)
channelRead:286, IdleStateHandler (io.netty.handler.timeout)
invokeChannelRead:362, AbstractChannelHandlerContext (io.netty.channel)
invokeChannelRead:348, AbstractChannelHandlerContext (io.netty.channel)
fireChannelRead:340, AbstractChannelHandlerContext (io.netty.channel)
channelRead:1434, DefaultChannelPipeline$HeadContext (io.netty.channel)
invokeChannelRead:362, AbstractChannelHandlerContext (io.netty.channel)
invokeChannelRead:348, AbstractChannelHandlerContext (io.netty.channel)
fireChannelRead:965, DefaultChannelPipeline (io.netty.channel)
read:163, AbstractNioByteChannel$NioByteUnsafe (io.netty.channel.nio)
processSelectedKey:644, NioEventLoop (io.netty.channel.nio)
processSelectedKeysPlain:544, NioEventLoop (io.netty.channel.nio)
processSelectedKeys:498, NioEventLoop (io.netty.channel.nio)
run:458, NioEventLoop (io.netty.channel.nio)
run:897, SingleThreadEventExecutor$5 (io.netty.util.concurrent)
run:834, Thread (java.lang)
倒排索引简介
到排索引解决什么问题?
当我们有一个文档a.txt
,里面有一堆文字hello wrold ,i am dinosaur
。
我们需要从所有文档里面判断这个文档里面是否存在world
这个词汇,应该怎么做呢?
当文档的数量很少的时候,可以
- 1 打开文件
- 2 从头开始去读取文件内容判断是否包含
world
那么当我们不仅仅只有一个文档a.txt
,我们还有b.txt
和c.txt
的时候,我们怎么判断某个词word
是否在这些文档里面呢?如果word
在里面,又在那些文档的第几行呢?
如果我们还用之前的从头开始一个个文件读的话,如果文档数量少还好,如果文档很多,我们就非常慢
才能读完所有的文档。
倒排索引解决的其中一个问题就是如何快速定位某个词是是否在这些文档中,如果在又在哪些文档里面。
相关例子
baseline invert index
倒排索引包括主要两个部分:
- 第一部分$word$包含$t$两个域:
- $f_{t}$: 文档(document)中包含词$t$的文档个数,也就是说有多少个文档含有词$t$,那么$f_{t}$等于几。
- 指向$inverted list$的指针
- 第二部分$invert list$是一个列表,列表的每个元素包括以下两个域:
- $docid$: 文档对应的id,可以理解为文档主键
- $f_{d}$: 该$docid$ 中包含词$t$的数量
我自己写了的demo代码github 地址,输出如下
keeper 3|[{1 1} {4 1} {5 1}]
In 1|[{2 1}]
house 2|[{2 1} {3 1}]
nignt 2|[{4 1} {5 1}]
did 1|[{4 1}]
dark 1|[{6 1}]
old 4|[{1 1} {2 1} {3 1} {4 1}]
night 3|[{1 1} {5 1} {6 1}]
had 1|[{3 1}]
sleeps 1|[{6 1}]
keep 3|[{1 1} {3 1} {5 1}]
big 2|[{2 1} {3 1}]
keeps 3|[{1 1} {5 1} {6 1}]
the 6|[{1 1} {2 1} {3 1} {4 1} {5 1} {6 1}]
never 1|[{4 1}]
and 1|[{6 1}]
And 1|[{6 1}]
in 5|[{1 1} {2 1} {3 1} {5 1} {6 1}]
The 3|[{1 1} {3 1} {5 1}]
sleep 1|[{4 1}]
Where 1|[{4 1}]
town 2|[{1 1} {3 1}]
gown 1|[{2 1}]
构造倒排索引的步骤
- 1 读取文档
- 2 分词
- 3 对分词正规化(normalized)
- 4 建立包含词频和偏移量的倒排索引
分词
https://www.cnblogs.com/forfuture1978/archive/2010/06/06/1752837.html
Lucene 的堆栈,主要的逻辑都在invert
方法里面
incrementToken:48, FilteringTokenFilter (org.apache.lucene.analysis)
invert:812, DefaultIndexingChain$PerField (org.apache.lucene.index)
processField:442, DefaultIndexingChain (org.apache.lucene.index)
processDocument:406, DefaultIndexingChain (org.apache.lucene.index)
updateDocument:250, DocumentsWriterPerThread (org.apache.lucene.index)
updateDocument:495, DocumentsWriter (org.apache.lucene.index)
updateDocument:1594, IndexWriter (org.apache.lucene.index)
addDocument:1213, IndexWriter (org.apache.lucene.index)
indexDoc:198, IndexFiles (com.dinosaur)
visitFile:155, IndexFiles$1 (com.dinosaur)
visitFile:151, IndexFiles$1 (com.dinosaur)
walkFileTree:2670, Files (java.nio.file)
walkFileTree:2742, Files (java.nio.file)
indexDocs:151, IndexFiles (com.dinosaur)
main:113, IndexFiles (com.dinosaur)
Lucene分词的核心在于incrementToken获取token
举个例子
Lucene的标准分词器
private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class); // final的单例
@Override
public final boolean incrementToken() throws IOException {
...
scanner.getText(termAtt); // scanner 返回一个词并将那个词设置到termAtt上面
...
}