Adaptive Matrix Online Learning through Smoothing with Guarantees for Nonsmooth Nonconvex Optimization
Ruichen Jiang, Zakaria Mhammedi, Mehryar Mohri, Aryan Mokhtari
TL;DR
The paper tackles online matrix optimization under an operator-norm constraint, where existing adaptive methods like Shampoo incur expensive quadratic projections. It extends Gradient-Based Prediction Algorithm (GBPA) to the matrix setting by introducing $(\alpha,\beta)$-admissible smoothings of the nuclear norm, achieving regret matching one-sided Shampoo up to constants. It then develops two efficient algorithms, FTPL (Gaussian smoothing) and FAML (hyperbolic smoothing), that avoid quadratic projections while retaining strong guarantees. Through Online-to-Nonconvex Conversion, the authors derive Pion and Leon with convergence guarantees to $(\rho,\varepsilon)$-stationary points for nonsmooth nonconvex matrix optimization, addressing gaps in existing methods like Muon. Empirical results on synthetic tasks illustrate stability and improved performance, highlighting the practical impact for spectral-structure-aware optimization in deep learning and quasi-Newton-style online updates.
Abstract
We study online linear optimization with matrix variables constrained by the operator norm, a setting where the geometry renders designing data-dependent and efficient adaptive algorithms challenging. The best-known adaptive regret bounds are achieved by Shampoo-like methods, but they require solving a costly quadratic projection subproblem. To address this, we extend the gradient-based prediction scheme to adaptive matrix online learning and cast algorithm design as constructing a family of smoothed potentials for the nuclear norm. We define a notion of admissibility for such smoothings and prove any admissible smoothing yields a regret bound matching the best-known guarantees of one-sided Shampoo. We instantiate this framework with two efficient methods that avoid quadratic projections. The first is an adaptive Follow-the-Perturbed-Leader (FTPL) method using Gaussian stochastic smoothing. The second is Follow-the-Augmented-Matrix-Leader (FAML), which uses a deterministic hyperbolic smoothing in an augmented matrix space. By analyzing the admissibility of these smoothings, we show both methods admit closed-form updates and match one-sided Shampoo's regret up to a constant factor, while significantly reducing computational cost. Lastly, using the online-to-nonconvex conversion, we derive two matrix-based optimizers, Pion (from FTPL) and Leon (from FAML). We prove convergence guarantees for these methods in nonsmooth nonconvex settings, a guarantee that the popular Muon optimizer lacks.
