Summary

Adam is the stochastic first order optimization algorithm, that uses historical information about stochastic gradients and incorporates it in attempt to estimate second order moment of stochastic gradients.

All vector operations are element-wise. - the default values for hyperparameters ( here is needed for avoiding zero division problems) and is the sample of stochastic gradient.

  • We can consider this approach as normalization of each parameter by using individual learning rates on , since and
  • There are some issues with Adam effectiveness and some works, stated, that adaptive metrics methods could lead to worse generalization.
  • The name came from “Adaptive Moment estimation”

Bounds

Conditions Type of convergence
Convex Sublinear  

Version of Adam for a strongly convex functions is considered i this work. The obtained rate is $ \mathcal{O}\left(\dfrac{\log k}{\sqrt{k}} \right) $, while the version for truly linear rate remains undiscovered.