### Notation

• $$t$$: Timer of the optimization process. It is increased by 1 each time we update parameters . It is set to zero at the start of the optimization process.
• $$\theta_t$$: Parameters at time $$t$$. It is a vector of dimension $$d$$ and its components are usually called features.
• $$g_t$$: Gradient at time $$t$$.
• $$\eta$$: Learning rate.

The standard gradient decent is given by

$$\theta_{t+1} = \theta_{t} - \eta g_{t}$$

A potential issue of this approach is that the learning rate is shared by all features in the parameters. This is appropriate if all features are similar in the sense that they contribute to the objective function in a similar way. Of course, this is not always true.

The similarity of features contribution can be characterized by the gradients. Imagine we put $$\{g_t\}_{1,...,T}$$ into a matrix

$$F = [g_1 \; g_2 \; ... \; g_T]$$

The matrix $$F$$ has $$d$$ rows and $$T$$ columns. We can think of each row as a time series of the measure of sensitivity of the corresponding feature. The idea of AdaGrad algorithm is that features with low sensitivity could use relatively large learning rate and features with high sensitivity could use relatively small learning rate. This idea can be expressed by the formula below

$$\eta_{i} = \frac{\eta}{\text{sensitivity of feature}\; i} \;\;\; for \; i = 1, 2, ..., d$$

The following approach is proposed in the paper:

$$\theta_{t+1} = \theta_{t} - \eta \; diag \left( \sum\limits_{\tau = 1}^{t-1} g_{\tau} g_{\tau}^{T} \right) ^ {-1/2} g_{t}$$

which means the variation of the sensitivity measure of feature $$i$$ is set to $$\sqrt {\sum\limits_{j=1}^{t-1} F_{i,j}^2 }$$. Note that the sensitivity is an increasing function with respect to $$t$$ which means the learning rates will decrease and vanish at late stage of the optimization process.

----- END -----

Welcome to join reddit self-learning community.

Want some fun stuff?