Information Retrieval - Language Model

Subscribe Send me a message home page


In Probabilistic Model, we use \(p(D|Q,r)\) to rank documents. We further assume given relevance terms are statistically independent, which gives us the following formula:

$$ p(D|Q,r) = \prod_{j=1}^{|V|} p(D_j|Q,r) $$

The score of a document depends on the contribution of terms it contains. The contribution of a term is measured by "weight". If a term appears in the query it has more weight and if a term does not appear in the query then it has little weight. The score of a document is the sum of "weight" of term contained in that document. (Taking the sum is only one way to aggregate weights. To use sum function, we need the independence assumption.)

Language model is a different approach that avoids the independence assumption. For example, we should expect the occurrences of the term "teacher" and "school" are highly correlated in education related articles. The main ideas of language model are

Once we know how to calculate the distribution of terms and use it to represent a document, we could easily calculate the distance between two documents. And once we can calculate the distance between any two documents, the ranking becomes straight forward: we rank documents based on \(dist(Q, D)\).

Kullback-Leibler Divergence

KL divergence is one of classic ways to calculate distance between two probability distribution:

$$ dist(Q, D) = \sum_{ t \in V } { \mathcal{M}_{q}(t) \cdot log \frac{\mathcal{M}_{q}(t)}{\mathcal{M}_{d}(t)} } = \sum_{ t \in V } { \mathcal{M}_{q}(t) \cdot log \mathcal{M}_{q}(t) } - \sum_{ t \in V } { \mathcal{M}_{q}(t) \cdot log \mathcal{M}_{d}(t) } $$

where \( M_k(t) \) is the density function of term \(t\) in document \(k\).

Note that expression

$$ \sum_{ t \in V } { \mathcal{M}_{q}(t) \cdot log \mathcal{M}_{q}(t) } $$

only depends on the query and it doesn't affect document ranking. Therefore, we only need to calculate

$$ \sum_{ t \in V } { \mathcal{M}_{q}(t) \cdot log \mathcal{M}_{d}(t) } $$

Next step is to define \(M_d(t)\).

Language Models and Smoothing

In this section we present some of the language models.

The first one is the maximum likelihood model. This is the empirical term distribution of the document:

$$ M_d^{ ml }(t) = \frac{ f_{t,d} } { l_{d} } $$

The maximum likelihood model does not take the collection of documents into account. Jelinek-Mercer smoothing model overcomes this drawback

We first define

$$ M_c(t) = l_t/l_c $$

where \(l_t\) is the number of times \(t\) occurs in the collection \(C\) and \(l_c\) is the total number of tokens in the collection. The Jelinek-Mercer smoothing model is just a linear combination of \(M_d^{ ml }(t)\) and \(M_c(t)\).

$$ M_d^{ \lambda }(t) = (1 - \lambda) \cdot M_d^{ ml }(t) + \lambda \cdot M_c^{ }(t) $$

Another drawback in maximum likelihood model \(M_d^{ ml }(t)\) is that if a term does not appear in the document then it has zero probability. In general we do want every term to have positive probability and a standard way to achieve this is to apply Dirichlet smoothing, which gives us

$$ M_d^{ \mu }(t) = \frac{ f_{t,d} + \mu M_c(t) } { l_d + \mu } $$

----- END -----

Send me a message Subscribe to blog updates

Want some fun stuff?