Related Reading
- Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
- Lecture 4: Optimization Landscape of Deep Learning
- Lecture 13: Data Modelling and Distributions
Notation
- \(t\): Timer of the optimization process. It is increased by 1 each time we update parameters . It is set to zero at the start of the optimization process.
- \(\theta_t\): Parameters at time \(t\). It is a vector of dimension \(d\) and its components are usually called features.
- \(g_t\): Gradient at time \(t\).
- \(\eta\): Learning rate.
AdaGrad Algorithm
The standard gradient decent is given by
A potential issue of this approach is that the learning rate is shared by all features in the parameters. This is appropriate if all features are similar in the sense that they contribute to the objective function in a similar way. Of course, this is not always true.
The similarity of features contribution can be characterized by the gradients. Imagine we put \(\{g_t\}_{1,...,T} \) into a matrix
The matrix \(F\) has \(d\) rows and \(T\) columns. We can think of each row as a time series of the measure of sensitivity of the corresponding feature. The idea of AdaGrad algorithm is that features with low sensitivity could use relatively large learning rate and features with high sensitivity could use relatively small learning rate. This idea can be expressed by the formula below
The following approach is proposed in the paper:
which means the variation of the sensitivity measure of feature \(i\) is set to \( \sqrt {\sum\limits_{j=1}^{t-1} F_{i,j}^2 }\). Note that the sensitivity is an increasing function with respect to \(t\) which means the learning rates will decrease and vanish at late stage of the optimization process.
----- END -----
©2019 - 2024 all rights reserved