SGD intuition in a scalar case
Gradient descent with an appropriate constant learning rate converges for a minimum for a convex function:
But what if the minimized function is not convex?
In contrast, Stochastic Gradient Descent (SGD) could escape local minimuma:
Recent studies suggest, that we should care not only about the depth of the local minimum, but of its width as well:
One more interesting case, where the classical convergence of Gradient Descent may not be useful:
While what initially looks like a clear divergence leads to a better minimum from the generalization perspective: