Lucene - Analysis Process

In this post, we provide a brief introduction to the analysis process in Lucene.

Outline

• Components
• Analyzer
• Token and Token Stream
• Synonym Expansion
• Complication
• Multivalued Field
• Different analyzers for different fields

Components

Analyzer

Analyzer is one of the building blocks in Lucene. It performs analysis on the raw input and we need to specify an analyzer when creating an index writer.

The Analyzer class is the abstract base class. It has the following method:

1
public TokenStream tokenStream(String fieldName, Reader reader)


Note that analyzer.tokenStream() method can be called directly but we only do this when we debug the code.

Token and Token Stream

A stream of tokens is the fundamental output of the analysis process. It is represented by the TokenStream class.

The overall workflow of token generation is illustrated by the figure below:

A token carries with it a text value (the word itself) as well as some metadata:

• The start and end character offsets in the original text.
• a token type. This is used in the token filtering process
• a position increment
• The default value is 1, which means tokens are successive
• token flag

(Note that token flag and token type are not stored in the index.)

One thing worth noting is how a token is represented in Lucene. Lucene doesn't create a Token object that holds all attributes. Instead, it adopts an attribute-based API. Here is the list of Lucene's built-in token attributes:

• TermAttribute
• PositionIncrementAttribute
• OffsetAttribute
• TypeAttribute
• FlagsAttribute

Note that only built-in attributes are used in the indexing

Synonym Expansion

Synonym Expansion is an interesting example. It helps us develop insights into how Lucene internal works. There are two important concepts involved:

• Attribute source state
• Position and position increment.

When we add something to the position (i.e. blue box), we need to specify the state of the system. This state is maintained by Lucene. Intuitively, there are a lot of things we need o track. For example, we need to track the start and end offset of the token in the original text. We usually don't need to deal with states because the built-in methods in Lucene take care of it.

However, this is not the case in synonym expansion. The reason is that after the analysis process, synonyms of a token will occupy the same position. This is expressed by setting the position increment of the synonyms to zero.

Complication

Multivalued Field

A major source of complication is multivalued field. Having multivalued fields usually requires us to implement our own analyzer. The issue is that multivalued fields are actually concatenated and we need to deal with offsets and position increment gap. The default behavior is to concatenate multiple values seamlessly. Here is an example provided in the book: suppose we have two values for the text field: (1) "It's time to pay income tax" and (2) "return library books on time.". If we search for "tax return", we don't want Lucene to return any results.

Different analyzers for different fields

By default, we can attach different analyzers to different documents. How about using different analyzers to different fields? We cannot do this directly. However, the tokenSteram method actually takes the fieldName as one of its arguments. We could simply control the analyzers when we generate the token stream.

Lucene provides a helper class called PerFieldAnalyzerWrapper, which is also very useful.

----- END -----

Welcome to join reddit self-learning community.

Want some fun stuff?