Information Retrieval - Token and Tokenization

Subscribe Send me a message home page tags


Note: this is a reading note of the book Information Retrieval: Implementing and Evaluating Search Engines.

Common Techniques

In order to build a inverted index, we need to first transform documents to tokens. This process is called tokenization. There are three traditional techniques used in tokenization

Stemming is a process to transform a word into its root form. For example, both "runs" and "running" would be converted to "run". Stopping is a process to ignore the stopwords such as "the", "I" in the documents. Stopwords do not provide too much information most of time and it is safe to ignore them in both document processing and query processing. The N-grams technique is to split a word into overlapping tokens. For example, if \(n=5\), then the word "orienteering" would be split into th following 5-grams:

_orie orien rient iente entee nteer teeri eerin ering ring_

One of the drawbacks of N-grams approach is obvisouly the increasing processing time and storage space.

Challenges

The tokenization process has many challenges:

Most of the time, the difficulty comes from the fact that there is no one-size-fits-all rule and there are always exceptions.

----- END -----

Welcome to join reddit self-learning community.
Send me a message Subscribe to blog updates

Want some fun stuff?

/static/shopping_demo.png