Note: this is a reading note of the book Information Retrieval: Implementing and Evaluating Search Engines.
In order to build a inverted index, we need to first transform documents to tokens. This process is called tokenization. There are three traditional techniques used in tokenization
Stemming is a process to transform a word into its root form. For example, both "runs" and "running" would be converted to "run". Stopping is a process to ignore the stopwords such as "the", "I" in the documents. Stopwords do not provide too much information most of time and it is safe to ignore them in both document processing and query processing. The N-grams technique is to split a word into overlapping tokens. For example, if \(n=5\), then the word "orienteering" would be split into th following 5-grams:
_orie orien rient iente entee nteer teeri eerin ering ring_
One of the drawbacks of N-grams approach is obvisouly the increasing processing time and storage space.
The tokenization process has many challenges:
- Special meaning
Most of the time, the difficulty comes from the fact that there is no one-size-fits-all rule and there are always exceptions.
----- END -----
©2019 - 2021 all rights reserved