A lot of practical task nowadays are being solved by the deep learning approach, which is usually implies finding local minimum of a non - convex function, that generalizes well (enough 😉). The goal of this short text is to provide you an importance of the optimization behind neural network training.
One of the most commonly used loss functions in classification tasks is the normalized categorical cross entropy in class problem:
Since in Deep Learning tasks the number of points in a dataset could be really huge, we usually use Stochastic gradient descent based approaches as a workhorse.
In such algorithms one uses the estimation of a gradient at each step instead of the full gradient vector, for example, in cross entropy we have:
The simplest approximation is statistically judged unbiased estimation of a gradient:
where we initially sample randomly only points and calculate sample average. It can be also considered as a noisy version of the full gradient approach.