Information Retrieval - Probabilistic Retrieval

Subscribe Send me a message home page tags


#Information Retrieval  #IR  #probabilistic retrieval 

Table of Contents

Introduction

The general idea of probabilistic retrieval is that given a query, we want to determine the probability of relevance for each document in the collection and sort the documents accordingly. The calculated probability is basically used as a score for document ranking.

The probability we want to calculate is

$$ p(r|D,Q) $$

where \(D\) is the document, \(Q\) is the query and \(r\) is a binary variable. If \(r = 1\), it means the document \(D\) is relevant to the query \(Q\), otherwise, \(D\) is irrelevant to the query \(Q\).

Definition: Log-odds:

$$ logit(p) = log( \frac{p}{1-p} ) $$
log_odds.png

Log-odds and probability are rank-equivalent - ranking by one produces the same ordering as ranking by the other.

Now, if we apply Bayes' rule, we have

$$ \begin{eqnarray} log \frac{p(r|D,Q)} {1 - p(r|D,Q)} & = & log \frac{p(D,Q|r)p(r)} {p(D,Q|\bar{r})p(\bar{r})} \\ & = & log \frac{p(D|Q, r)}{p(D|Q, \bar{r})} + log \frac{p(r|Q)} {p(\bar{r}|Q)} \end{eqnarray} $$

The second term in the last line does not depend on the document \(D\) and we can consider it an indicator of query difficulty. It follows that ranking by log-odds of \(p(r|D,Q)\) is equivalent to ranking by

$$ log \frac{p(D|Q,r)}{p(D|Q,\bar{r})} $$

In the following sections we will talk about how to estimate this value.

The Binary Independence Model

In the Binary Independence Model, both documents and queries are presented by binary vectors of length \(|V|\) where \(V\) is the vocabulary.

More specifically

$$ D = (D_1, D_2, ..., D_{|V|}) \\ Q = (Q_1, Q_2, ..., Q_{|V|}) $$

where \(D_i = 1\) if term \(i\) appears in the document \(D\), otherwise \(D_i = 0\). The same applies to \(Q_i\).

The Binary Independence Model makes two strong assumptions

The Assumption T is a classic type of assumptions that allows us to decompose a formula into multiple components. In mathematical terms, Assumption T can be expressed by

$$ p(D|Q,r) = \prod_{j = 1}^{|V|} p(D_j|Q, r) \\ p(D|Q,\bar{r}) = \prod_{j = 1}^{|V|} p(D_j|Q, \bar{r}) \\ $$

Assumption Q basically says given a term \(i\), if \(Q_i = 0\) then \(p(D_i|Q, r) = p(D_i|Q,\bar{r})\). Here is an example, suppose we want to search for documents about "Mathematics" and the search engine returns two sets of documents (1) math related documents and (2) non-math related documents. The Assumption Q implies that the probability of having a term , which does not appear in the query such as "theory", "definition", in the documents is the same for the two different sets. This is obviously not true in real world because math related documents are more likely to contain terms like "theory" or "definition".

Combining both assumptions together, we have

$$ \log \frac {p(D|Q,r)} {p(D|Q, \bar{r})} = \sum_{t\in{}q} log \frac {p(D_t = d_t|r)} {p(D_t = d_t|\bar{r})} $$

All we care is ranking so we can transform the scoring formula a little bit as long as the ordering is preserved. The trick here is to subtract a value which does not depend on documents

$$ \sum_{t\in{}q} log \frac {p(D_t = 0 | r)} {p(D_t = 0 | \bar{r})} $$

After some arrangement, we have

$$ \log \frac {p(D|Q,r)} {p(D|Q, \bar{r})} \approx \sum_{t \in (q \cap d) } log \frac {p(D_t = 1 | r)p(D_t = 0|\bar{r})} {p(D_t = 1 | \bar{r})p(D_t = 0|r)} \tag{1} $$

The Robertson/Sparck Jones Weighting Formula

Let's define the following terms

$$ p_t = p(D_t = 1 | r) \\ \bar{p_t} = p(D_t = 1 | \bar{r}) \\ w_t = log \frac {p_t(1 - \bar{p_t})} {\bar{p_t}(1 - p_t)} $$

The scoring formula in Eq (1) can be expressed simply as

$$ \sum_{t \in (q \cap d)} w_t $$

If we set \( p_t = \frac {1} { 1 + (N - N_t) / N } \), then the formula becomes

$$ w_t = log \frac {N} {N_t} $$

which is the standard IDF formula.

Note that the weight \(w_t\) here is a weight of the term \(t\) in the collections and it doesn't depend on any specific document.

Term Frequency and BM25

We can do better than the Binary Independence Model. Instead of using a binary representation of documents and queries, we could use term frequency vectors, which provide a more accurate picture of documents.

In 1994, Robertson and Walker suggest a simple approximation to the two-Poisson term weight model that is based on a notion called "eliteness". A document is said to be elite in term \(t\) when it is somehow "about" the topic associated with the term.

The proposed the formula is

$$ \sum_{t\in{}q} \frac {f_{t,d}(k_1 + 1)} {k_1 + f_{t,d}} \cdot w_t $$

We can incorporate the query terms as well, which gives

$$ \sum_{t\in{}q} \frac {q_t(k_3 + 1)} {k_3 + q_t} \cdot \frac {f_{t,d}(k_1 + 1)} {k_1 + f_{t,d}} \cdot w_t $$

If \(q_t \to \infty \), then \( \frac {q_t(k_3 + 1)} {k_3 + q_t} \to k_3 + 1 \). So \(k_3\) represents the maximum contribution of a term being in the query. We could assume if a term appear many times in the query then it is important. It's common to set \(k_3\) to a very large value and if we set \(k_3\) to \(\infty\), the formula can reduce to

$$ \sum_{t\in{}q} q_t \cdot \frac {f_{t,d}(k_1 + 1)} {k_1 + f_{t,d}} \cdot w_t \tag{2} $$

BM25

What is missing in the previous models is the consideration on document lengths. If we incorporate documents lengths, we will have the BM25 Model

$$ \sum_{t\in{}q} q_t \cdot \frac {f_{t,d}(k_1 + 1)} {k_1 ((1-b) + b(l_d/l_{avg})) + f_{t,d}} \cdot w_t \tag{3} $$

Field Weights and BM25F

There is still one element missing in Eq (3). We didn't take the document structure into account. For example, there are many tags in a html file and some of the tags are more important than others. A term inside <title> tag should carry more weight than the same term inside a <p> tag.

Here we present the BM25F model which incorporates document structure and field weights.

It defines the following terms

$$ \begin{eqnarray} f'_{t,d,s} & = & \frac {f_{t,d,s}} {(1 - b_s) + b_s ( l_{d,s} / l_s )} \\ \;\\ f'_{t,d} & = & \sum_s v_s \cdot f'_{t,d,s} \end{eqnarray} $$

As we can see here, \(v_s\) is used to adjust the relative importance of different fields \(s\).

The score formula in BM25F is

$$ \sum_{t\in{}q} q_t \cdot \frac {f'_{t,d}(k_1 + 1)} {k_1 + f'_{t,d}} \cdot w_t $$

----- END -----

If you have questions about this post, you could find me on Discord.
Send me a message Subscribe to blog updates

Want some fun stuff?

/static/shopping_demo.png