Table of Contents
- Introduction
- The Binary Independence Model
- Weighting Formula
- Term Frequency and BM25 Model
- Field Weights and BM25F Model
Introduction
The general idea of probabilistic retrieval is that given a query, we want to determine the probability of relevance for each document in the collection and sort the documents accordingly. The calculated probability is basically used as a score for document ranking.
The probability we want to calculate is
where \(D\) is the document, \(Q\) is the query and \(r\) is a binary variable. If \(r = 1\), it means the document \(D\) is relevant to the query \(Q\), otherwise, \(D\) is irrelevant to the query \(Q\).
Definition: Log-odds:

Log-odds and probability are rank-equivalent - ranking by one produces the same ordering as ranking by the other.
Now, if we apply Bayes' rule, we have
The second term in the last line does not depend on the document \(D\) and we can consider it an indicator of query difficulty. It follows that ranking by log-odds of \(p(r|D,Q)\) is equivalent to ranking by
In the following sections we will talk about how to estimate this value.
The Binary Independence Model
In the Binary Independence Model, both documents and queries are presented by binary vectors of length \(|V|\) where \(V\) is the vocabulary.
More specifically
where \(D_i = 1\) if term \(i\) appears in the document \(D\), otherwise \(D_i = 0\). The same applies to \(Q_i\).
The Binary Independence Model makes two strong assumptions
- Assumption T: Given relevance, terms are statistically independent.
- Assumption Q: The presence of a term in a document depends on relevance only when that term is present in the query.
The Assumption T is a classic type of assumptions that allows us to decompose a formula into multiple components. In mathematical terms, Assumption T can be expressed by
Assumption Q basically says given a term \(i\), if \(Q_i = 0\) then \(p(D_i|Q, r) = p(D_i|Q,\bar{r})\). Here is an example, suppose we want to search for documents about "Mathematics" and the search engine returns two sets of documents (1) math related documents and (2) non-math related documents. The Assumption Q implies that the probability of having a term , which does not appear in the query such as "theory", "definition", in the documents is the same for the two different sets. This is obviously not true in real world because math related documents are more likely to contain terms like "theory" or "definition".
Combining both assumptions together, we have
All we care is ranking so we can transform the scoring formula a little bit as long as the ordering is preserved. The trick here is to subtract a value which does not depend on documents
After some arrangement, we have
The Robertson/Sparck Jones Weighting Formula
Let's define the following terms
The scoring formula in Eq (1) can be expressed simply as
If we set \( p_t = \frac {1} { 1 + (N - N_t) / N } \), then the formula becomes
which is the standard IDF formula.
Note that the weight \(w_t\) here is a weight of the term \(t\) in the collections and it doesn't depend on any specific document.
Term Frequency and BM25
We can do better than the Binary Independence Model. Instead of using a binary representation of documents and queries, we could use term frequency vectors, which provide a more accurate picture of documents.
In 1994, Robertson and Walker suggest a simple approximation to the two-Poisson term weight model that is based on a notion called "eliteness". A document is said to be elite in term \(t\) when it is somehow "about" the topic associated with the term.
The proposed the formula is
We can incorporate the query terms as well, which gives
If \(q_t \to \infty \), then \( \frac {q_t(k_3 + 1)} {k_3 + q_t} \to k_3 + 1 \). So \(k_3\) represents the maximum contribution of a term being in the query. We could assume if a term appear many times in the query then it is important. It's common to set \(k_3\) to a very large value and if we set \(k_3\) to \(\infty\), the formula can reduce to
BM25
What is missing in the previous models is the consideration on document lengths. If we incorporate documents lengths, we will have the BM25 Model
Field Weights and BM25F
There is still one element missing in Eq (3). We didn't take the document structure into account. For example, there are many tags in a html file and some of the tags are more important than others. A term inside <title> tag should carry more weight than the same term inside a <p> tag.
Here we present the BM25F model which incorporates document structure and field weights.
It defines the following terms
As we can see here, \(v_s\) is used to adjust the relative importance of different fields \(s\).
The score formula in BM25F is
----- END -----
©2019 - 2023 all rights reserved