Information Retrieval - Query Processing

Subscribe Send me a message home page

Query Processing

Ranked Retrieval

As the results returned by the search engine are ranked, a scoring system must be used. The score should be a function of

Let \(\textrm{score}(q, d) \) denote the score function. Usually, it's a function of \( \textrm{score}(t, d) \), where \(t\) is a term in the query and takes the following form:

$$ \textrm{score}(q, d) = \textrm{quality}(d) + \sum_{t \in d}\textrm{score}(t, d) $$

There are two main types of query processing:

As the names suggest, document-at-a-time query processing processes one document at a time while term-at-a-time query processing uses term as the processing unit.

There exists techniques to optimize performance:

Lightweight Structure

Concordance list

A concordance list is a list of intervals where no internal in this list have another interval from the list nested within it.


We could define the following operators on concordance lists.


where \(\mathcal{G}\) is function defined as

$$ \mathcal{G}(S) = \{ a \,|\, a \in S \; \textrm{and}\; \nexists \; b \in S \; \textrm{such that} \; b \sqsubset a \} $$

Note that there are some details missing here. As we may notice \(\mathcal{G}\) is not well-defined because an empty set always satisfies the condition on the right side of the equation. Here, we are looking for the "max" set that satisfies the condition. To get the max \(\mathcal{G}(S)\) when \(S\) contains nested interval, we could follow the following reduction process:

Suppose \([a, b] \in S\) and \([c, d] \in S\) such as \([a, b] \subset [c, d] \). We could remove \([c, d]\) from \(S\) and repeat the process until \(S\) satisfies the condition. The final \(S\) when the reduction process is completed is denoted by \(\mathcal{G}(S)\).

Helper methods

The helper methods presented in this section are very similar to those used for the posting list operations. We define



----- END -----

Send me a message Subscribe to blog updates

Want some fun stuff?