Note: this is a reading note of the book Information Retrieval: Implementing and Evaluating Search Engines.
In this post, we will briefly talk about some basic techniques used in information retrieval systems. We first describe the inverted index, a core data structure used in most IR system. Then we present three common techniques to retrieve information from a collection of documents. Finally, we describe methods that can be used to evaluate the performance of IR system.
1. Inverted Index
An inverted index is a mapping between terms and their locations of occurrence in a text collection \(C\). It's the central data structure used in information retrieval system.
We further assume the following methods are available:
- first(t): returns the first position at which the term t occurs in the collection;
- last(t): returns the last position at which the term t occurs in the collection;
- next(t, currentPos): returns the position of t's first occurrence after the current position;
- prev(t, currentPos): returns the position of t's last occurrence before the current posiiton;

Summary Of Notation

Example: Phrase Search
Definition. A phrase is a list of terms enclosed in double quotes.
def nextPhrase(listOfTerms, currentPos):
"""
Find the next occurrence of the phrase after the current position.
@listOfTerms: terms in the phrase.
@currentPos: current position in the text.
@return: [startPos, endPos] where statPos is the start location of the next phrase and endPos is
the end location of the next phrase.
"""
v = currentPos
for term in listOfTerms:
v = next(term, v)
if v == END_OF_COLLECTION:
return [END_OF_COLLECTION, END_OF_COLLECTION]
u = v
n = len(listOfTerms)
for i in range(1, n + 1):
k = n - i
u = prev(listOfTerms[k])
if v - u == n - 1:
return [u, v]
else:
return nextPhrase(listOfTerms, u)
As we can see here the algorithm is straightforward if the inverted index and those four methods mentioned earlier are available.
2. Document-Oriented Index
Sometimes it is helpful to treat document as the unit of retrieval. Here the document is a general concept. It can be a text file, a chapter in the book or just a couple lines in a paragraph. A document-oriented index is a schema-dependent index and the location of a term is characterized by two numbers: (1) document id and (2) the offset in that document.
We can think of document-oriented index as a kind of "batch processing" and some of the statistics about term and document can provide valuable information, such as
- \(N_t\): the number of documents containing \(t\).
- \(f_{t,d}\): the number of occurrences of \(t\) within the document \(d\).
3. Retrieval and Ranking
There are three common approaches
- TF-IDF and \(cos\) similarity
- Proximity ranking
- Boolean filters
The first approach TF-IDF uses vectors in a high dimensional space to represent documents and queries. The result of the query is a list of ranked documents. The document that is most similar to the query has the highest ranking. This is one of the classic retrieval methods. However, it's overtaken by modern probability and machine learning based approaches.
The second approach is Proximity Ranking. The idea is simple: if the terms in the query are close together in a document, then that document may be relevant to the query. Similar to the phrase search algorithm, we will find covers in documents. A cover is a range [u, v] such that (1) [u, v] is inside a document and (2) [u, v] contains all terms in the query.
The third approach is called Boolean Filter. Strictly speaking, this is not a ranking method. As the name indicates, it's a filter. Boolean operators AND, OR and NOT are used in the query. The implementation is, once again, similar to the phrase search algorithm. This time, we need to find the documentation range [doc_u, doc_v] such that the documents within this range jointly satisfy the query. If the range is of size 1 then the range is reduced to one single document, which satisfies the query.
4. Evaluation
Recall and precision are two widely used effectiveness measures. A trade-off frequently exists between them, such that increasing recall leads to a corresponding decrease in precision. The efficiency is usually measured by response time.

----- END -----
©2019 - 2022 all rights reserved