In this post, we will present a brief introduction to Lucene core components and document indexing.
This is a reading note of Lucene In Action - Second Edition so some of the notions may not be up-to-date but the general idea should still be valid.
Lucene is a high-performance, scalable information retrieval (IR) library. It concerns itself with text indexing and searching.
The figure below presents the overall process of a search application:
As we can see in the figure, there are two directions of flow:
- Flow from users to the index
- Flow from raw content to the index
The first flow is the searching part of a search application and the second flow is the indexing part. In this post, we will focus on the indexing part of a search application.
Indexing is a process that transforms the original data into a highly efficient cross-reference lookup in order to facilitate rapid searching.
The indexing workflow is illustrated in the figure below:
To build an index, we first need to collect our data, which is the raw content of the index. Then we need to transform the row content into documents. A document is Lucene's atomic unit of data used for indexing and searching. It's a container that holds one or more fields, which in turn contain the "real" content. Creating documents can be challenging because we need to deal with many different formats and need to figure out a unified representation of all the data. Note that this step is application-specific and it's not Lucene's responsibility.
Not all documents are equal. Some documents are more important than others. The importance of documents is represented by the boosting factor. There are two types of boosting:
- dynamic: The boosting factor is set during query time.
- static: The boosting factor is set during index time.
The transformation from documents to tokens is done by analyzers in Lucene. We will cover this in later posts.
Lucene provides the following core classes to support indexing and searching
Click here or the image to get PDF version of the mind map.
----- END -----
©2019 - 2022 all rights reserved