SGD intuition in a scalar case

Gradient descent with an appropriate constant learning rate converges for a minimum for a convex function:

But what if the minimized function is not convex?

In contrast, Stochastic Gradient Descent (SGD) could escape local minimuma:

Recent studies suggest, that we should care not only about the depth of the local minimum, but of its width as well:

Idea of SAM Idea of SAM Idea of SAM

One more interesting case, where the classical convergence of Gradient Descent may not be useful:

While what initially looks like a clear divergence leads to a better minimum from the generalization perspective:

Code