# Summary

Suppose, our target function is the sum of functions.

This problem usually arises in Deep Learning, where the gradient of the loss function is calculating over the huge number of data points, which could be very expensive in terms of the iteration cost (calculation of gradient is linear in $n$).

Thus, we can switch from the full gradient calculation to its unbiased estimator:

where we randomly choose $i_k$ index of point at each iteration uniformly:

Iterations could be $n$ times cheaper! But convergence requires $\alpha_k \to 0$.

# Convergence

## General setup

We consider classic finite-sample average minimization:

Let us consider stochastic gradient descent assuming $\nabla f$ is Lipschitz:

Lipschitz continiity implies:

using $(\text{SGD})$:

Now let’s take expectation with respect to $i_k$:

Using linearity of expectation:

Since uniform sampling implies unbiased estimate of gradient: $\mathbb{E}[\nabla f_{i_k}(x_k)] = \nabla f(x_k)$:

## Polyak-Lojasiewicz conditions

This inequality simply requires that the gradient grows faster than a quadratic function as we move away from the optimal function value. Note, that strong convexity implies $\text{PL}$, but not vice versa. Using $\text{PL}$ we can write:

This bound already indicates, that we have something like linear convergence if far from solution and gradients are similar, but no progress if close to solution or have high variance in gradients at the same time.

## Stochastic subgradient descent

for some $g_{i_k} \in \partial f_{i_k}(x_k)$.

For convex $f$ we have

Here we can see, that step-size $\alpha_k$ controls how fast we move towards solution. And squared step-size $\alpha_k^2$ controls how much variance moves us away. Usually, we bound $\mathbb{E}[|g_{i_k}|^2]$ by some constant $B^2$.

If we also have strong convexity:

And finally, with $\alpha_k = \alpha < \frac{2}{\mu}$:

where $R = |x_0- x^*| $

# Bounds

Conditions | $\Vert \mathbb{E} [f(x_k)] - f(x^*)\Vert \leq$ | Type of convergence |
---|---|---|

Convex, Lipschitz-continuous gradient (L) | $ \mathcal{O}\left(\dfrac{1}{\sqrt{k}} \right) $ | Sublinear |

$ \mu$-Strongly convex, Lipschitz-continuous gradient (L) | $\mathcal{O}\left(\dfrac{1}{k} \right) $ | Sublinear |

Convex, non-smooth | $ \mathcal{O}\left(\dfrac{1}{\sqrt{k}} \right) $ | Sublinear |

$ \mu$-Strongly convex, non-smooth | $\mathcal{O}\left(\dfrac{1}{k} \right) $ | Sublinear |

# Code

# References

- Lecture by Mark Schmidt @ University of British Columbia
- Convergence theorems on major cases of GD, SGD (projected version included)