AdaGrad Algorithm

Subscribe Send me a message home page

Related Reading


AdaGrad Algorithm

The standard gradient decent is given by

$$ \theta_{t+1} = \theta_{t} - \eta g_{t} $$

A potential issue of this approach is that the learning rate is shared by all features in the parameters. This is appropriate if all features are similar in the sense that they contribute to the objective function in a similar way. Of course, this is not always true.

The similarity of features contribution can be characterized by the gradients. Imagine we put \(\{g_t\}_{1,...,T} \) into a matrix

$$ F = [g_1 \; g_2 \; ... \; g_T] $$

The matrix \(F\) has \(d\) rows and \(T\) columns. We can think of each row as a time series of the measure of sensitivity of the corresponding feature. The idea of AdaGrad algorithm is that features with low sensitivity could use relatively large learning rate and features with high sensitivity could use relatively small learning rate. This idea can be expressed by the formula below

$$ \eta_{i} = \frac{\eta}{\text{sensitivity of feature}\; i} \;\;\; for \; i = 1, 2, ..., d $$

The following approach is proposed in the paper:

$$ \theta_{t+1} = \theta_{t} - \eta \; diag \left( \sum\limits_{\tau = 1}^{t-1} g_{\tau} g_{\tau}^{T} \right) ^ {-1/2} g_{t} $$

which means the variation of the sensitivity measure of feature \(i\) is set to \( \sqrt {\sum\limits_{j=1}^{t-1} F_{i,j}^2 }\). Note that the sensitivity is an increasing function with respect to \(t\) which means the learning rates will decrease and vanish at late stage of the optimization process.

----- END -----

Send me a message Subscribe to blog updates

Want some fun stuff?