Steepest descent in three norms: SGD vs SignGD vs Muon
1 Setup
Ill-conditioned matrix quadratic with condition number \kappa = 20:
f(W) = \frac{1}{2}\|\Sigma W\|_F^2, \qquad W \in \mathbb{R}^{8 \times 8}.
Three optimizers correspond to steepest descent in three different norms:
| Optimizer | Norm | Update direction |
|---|---|---|
| SGD | Frobenius \|\cdot\|_F | -G / \|G\|_F |
| SignGD | Elementwise \ell_\infty | -\text{sign}(G) |
| Muon | Spectral \|\cdot\|_\sigma | -UV^\top (polar factor) |
Learning rates are individually tuned via grid search for each method.
2 Animation
Left panel: convergence of f(W_k) / f(W_0) in log scale. All three methods converge, but at different rates depending on how well the norm matches the problem geometry.
Right panel: normalized singular value spectrum of the update direction \Delta W at each iteration. SGD concentrates almost all energy on the top 1–2 singular directions (the dominant eigenvectors of \Sigma^2). Muon distributes energy uniformly across all directions — every singular value of UV^\top equals 1 by construction.
3 Key observation
The spectral structure of the update reveals why the methods differ:
- SGD is biased toward the directions where the gradient is largest (high-curvature directions). Low-curvature directions receive negligible updates.
- SignGD normalizes per-element, partially compensating for scale differences, but does not account for correlations between coordinates.
- Muon normalizes the entire matrix spectrally — all singular directions get equal weight. This is equivalent to steepest descent in the spectral norm, or equivalently, the LMO over the unit ball of the nuclear norm.