ADAM: A Method for Stochastic Optimization

0.1 Summary

Adam is the stochastic first order optimization algorithm, that uses historical information about stochastic gradients and incorporates it in attempt to estimate second order moment of stochastic gradients.

\tag{ADAM} \begin{align*} x_{k+1} &= x_k - \alpha_k \dfrac{\widehat{m_k}}{\sqrt{\widehat{v_k}} + \epsilon} \\ \tag{First moment estimation} \widehat{m_k} &= \dfrac{m_k}{1 - \beta_1^k} \\ m_k &= \beta_1 m_{k-1} + (1 - \beta_1) g_k \\ \tag{Second moment estimation} \widehat{v_k} &= \dfrac{v_k}{1 - \beta_2^k} \\ v_k &= \beta_2 v_{k-1} + (1 - \beta_2)g_k^2 \\ \end{align*}

All vector operations are element-wise. \alpha = 0.001, \beta_1 = 0.9, \beta_2 = 0.999 - the default values for hyperparameters (\epsilon here is needed for avoiding zero division problems) and g_k = \nabla f(x_k, \xi_k) is the sample of stochastic gradient.

We can consider this approach as normalization of each parameter by using individual learning rates on $ (0,1)$, since \mathbb{E}\_{\xi_k}[g_k] = \mathbb{E}\_{\xi_k}[\widehat{m_k}] and \mathbb{E}\_{\xi_k}[g_k \odot g_k] = \mathbb{E}\_{\xi_k}[\widehat{v_k}].
There are some issues with Adam effectiveness and some works, stated, that adaptive metrics methods could lead to worse generalization.
The name came from “Adaptive Moment estimation”.

0.2 Bounds

Conditions	\Vert \mathbb{E} [f(x_k)] - f(x^*)\Vert \leq	Type of convergence	\Vert \mathbb{E}[x_k] - x^* \Vert \leq
Convex	\mathcal{O}\left(\dfrac{1}{\sqrt{k}} \right)	Sublinear

Version of Adam for a strongly convex functions is considered in this work. The obtained rate is \mathcal{O}\left(\dfrac{\log k}{\sqrt{k}} \right), while the version for truly linear rate remains undiscovered.