# ADAM: A Method for Stochastic Optimization

### 0.1 Summary

Adam is the stochastic first order optimization algorithm, that uses historical information about stochastic gradients and incorporates it in attempt to estimate second order moment of stochastic gradients.

\tag{ADAM} \begin{align*} x_{k+1} &= x_k - \alpha_k \dfrac{\widehat{m_k}}{\sqrt{\widehat{v_k}} + \epsilon} \\ \tag{First moment estimation} \widehat{m_k} &= \dfrac{m_k}{1 - \beta_1^k} \\ m_k &= \beta_1 m_{k-1} + (1 - \beta_1) g_k \\ \tag{Second moment estimation} \widehat{v_k} &= \dfrac{v_k}{1 - \beta_2^k} \\ v_k &= \beta_2 v_{k-1} + (1 - \beta_2)g_k^2 \\ \end{align*}

All vector operations are element-wise. \alpha = 0.001, \beta_1 = 0.9, \beta_2 = 0.999 - the default values for hyperparameters (\epsilon here is needed for avoiding zero division problems) and g_k = \nabla f(x_k, \xi_k) is the sample of stochastic gradient.

- We can consider this approach as normalization of each parameter by using individual learning rates on $ (0,1)$, since \mathbb{E}\_{\xi_k}[g_k] = \mathbb{E}\_{\xi_k}[\widehat{m_k}] and \mathbb{E}\_{\xi_k}[g_k \odot g_k] = \mathbb{E}\_{\xi_k}[\widehat{v_k}].
- There are some issues with Adam effectiveness and some works, stated, that adaptive metrics methods could lead to worse generalization.
- The name came from â€ś
**Ada**ptive**M**oment estimationâ€ť.

### 0.2 Bounds

Conditions | \Vert \mathbb{E} [f(x_k)] - f(x^*)\Vert \leq | Type of convergence | \Vert \mathbb{E}[x_k] - x^* \Vert \leq |
---|---|---|---|

Convex | \mathcal{O}\left(\dfrac{1}{\sqrt{k}} \right) | Sublinear |

Version of Adam for a strongly convex functions is considered in this work. The obtained rate is \mathcal{O}\left(\dfrac{\log k}{\sqrt{k}} \right), while the version for truly linear rate remains undiscovered.