### Introduction

In Probabilistic Model, we use \(p(D|Q,r)\) to rank documents. We further assume given relevance terms are statistically independent, which gives us the following formula:

The score of a document depends on the contribution of terms it contains. The contribution of a term is measured by "weight". If a term appears in the query it has more weight and if a term does not appear in the query then it has little weight. The score of a document is the sum of "weight" of term contained in that document. (Taking the sum is only one way to aggregate weights. To use *sum function*, we need the independence assumption.)

Language model is a different approach that avoids the independence assumption. For example, we should expect the occurrences of the term "teacher" and "school" are highly correlated in education related articles. The main ideas of language model are

- Treat query as a document.
- Use distribution of terms to represent to represent documents.

Once we know how to calculate the distribution of terms and use it to represent a document, we could easily calculate the distance between two documents. And once we can calculate the distance between any two documents, the ranking becomes straight forward: we rank documents based on \(dist(Q, D)\).

### Kullback-Leibler Divergence

KL divergence is one of classic ways to calculate distance between two probability distribution:

where \( M_k(t) \) is the density function of term \(t\) in document \(k\).

Note that expression

only depends on the query and it doesn't affect document ranking. Therefore, we only need to calculate

Next step is to define \(M_d(t)\).

### Language Models and Smoothing

In this section we present some of the language models.

The first one is the maximum likelihood model. This is the *empirical* term distribution of the document:

The maximum likelihood model does not take the collection of documents into account. Jelinek-Mercer smoothing model overcomes this drawback

We first define

where \(l_t\) is the number of times \(t\) occurs in the collection \(C\) and \(l_c\) is the total number of tokens in the collection. The Jelinek-Mercer smoothing model is just a linear combination of \(M_d^{ ml }(t)\) and \(M_c(t)\).

Another drawback in maximum likelihood model \(M_d^{ ml }(t)\) is that if a term does not appear in the document then it has zero probability. In general we do want every term to have positive probability and a standard way to achieve this is to apply Dirichlet smoothing, which gives us

----- END -----

©2019 - 2021 all rights reserved