# Information Retrieval - Query Processing

## Query Processing

### Ranked Retrieval

As the results returned by the search engine are ranked, a scoring system must be used. The score should be a function of

• search query $$q$$
• document $$d$$

Let $$\textrm{score}(q, d)$$ denote the score function. Usually, it's a function of $$\textrm{score}(t, d)$$, where $$t$$ is a term in the query and takes the following form:

$$\textrm{score}(q, d) = \textrm{quality}(d) + \sum_{t \in d}\textrm{score}(t, d)$$

There are two main types of query processing:

• document-at-a-time
• term-at-a-time

As the names suggest, document-at-a-time query processing processes one document at a time while term-at-a-time query processing uses term as the processing unit.

There exists techniques to optimize performance:

• Use heap/priority queue. Most of the time, we don't need to sort all documents but only a subset of candidate documents that has potential to make into the final results.
• Pruning. The idea is that at some point during the processing, we would know that certain documents will never make into the final results and they should be removed. Example: pruning based on MaxScore with Okapi BM25.
• Precomputing score contribution. $$\textrm{score}(t, d)$$ can be computed at indexing time. This saves times when processing queries.
• Impact Ordering. The ordering in the original document collection doesn't have any significance most of the time. A different ordering may provide a better performance.
• Static Index Pruning
• Term-centric
• Document-centric
• Correctness guarantee

### Lightweight Structure

#### Concordance list

A concordance list is a list of intervals where no internal in this list have another interval from the list nested within it.

#### Operators

We could define the following operators on concordance lists.

where $$\mathcal{G}$$ is function defined as

$$\mathcal{G}(S) = \{ a \,|\, a \in S \; \textrm{and}\; \nexists \; b \in S \; \textrm{such that} \; b \sqsubset a \}$$

Note that there are some details missing here. As we may notice $$\mathcal{G}$$ is not well-defined because an empty set always satisfies the condition on the right side of the equation. Here, we are looking for the "max" set that satisfies the condition. To get the max $$\mathcal{G}(S)$$ when $$S$$ contains nested interval, we could follow the following reduction process:

Suppose $$[a, b] \in S$$ and $$[c, d] \in S$$ such as $$[a, b] \subset [c, d]$$. We could remove $$[c, d]$$ from $$S$$ and repeat the process until $$S$$ satisfies the condition. The final $$S$$ when the reduction process is completed is denoted by $$\mathcal{G}(S)$$.

#### Helper methods

The helper methods presented in this section are very similar to those used for the posting list operations. We define

• The method $$\tau(S, k)$$ returns the first interval in the GC-list $$S$$ starting at or after the position $$k$$.
• The method $$\rho(S, k)$$ returns the first interval in $$S$$ ending at or after the position $$k$$.
• The method $$\tau'(S, k)$$ is the converse of $$\tau$$. It returns the last interval in $$S$$ ending at or before the position $$k$$.
• The method $$\rho'(S, k)$$ is the converse of $$\rho$$. It returns the last interval in $$S$$ starting at or before the position $$k$$.

And

----- END -----

Welcome to join reddit self-learning community.

Want some fun stuff?