Table of Contents
Fetching ...

Adaptive Matrix Online Learning through Smoothing with Guarantees for Nonsmooth Nonconvex Optimization

Ruichen Jiang, Zakaria Mhammedi, Mehryar Mohri, Aryan Mokhtari

TL;DR

The paper tackles online matrix optimization under an operator-norm constraint, where existing adaptive methods like Shampoo incur expensive quadratic projections. It extends Gradient-Based Prediction Algorithm (GBPA) to the matrix setting by introducing $(\alpha,\beta)$-admissible smoothings of the nuclear norm, achieving regret matching one-sided Shampoo up to constants. It then develops two efficient algorithms, FTPL (Gaussian smoothing) and FAML (hyperbolic smoothing), that avoid quadratic projections while retaining strong guarantees. Through Online-to-Nonconvex Conversion, the authors derive Pion and Leon with convergence guarantees to $(\rho,\varepsilon)$-stationary points for nonsmooth nonconvex matrix optimization, addressing gaps in existing methods like Muon. Empirical results on synthetic tasks illustrate stability and improved performance, highlighting the practical impact for spectral-structure-aware optimization in deep learning and quasi-Newton-style online updates.

Abstract

We study online linear optimization with matrix variables constrained by the operator norm, a setting where the geometry renders designing data-dependent and efficient adaptive algorithms challenging. The best-known adaptive regret bounds are achieved by Shampoo-like methods, but they require solving a costly quadratic projection subproblem. To address this, we extend the gradient-based prediction scheme to adaptive matrix online learning and cast algorithm design as constructing a family of smoothed potentials for the nuclear norm. We define a notion of admissibility for such smoothings and prove any admissible smoothing yields a regret bound matching the best-known guarantees of one-sided Shampoo. We instantiate this framework with two efficient methods that avoid quadratic projections. The first is an adaptive Follow-the-Perturbed-Leader (FTPL) method using Gaussian stochastic smoothing. The second is Follow-the-Augmented-Matrix-Leader (FAML), which uses a deterministic hyperbolic smoothing in an augmented matrix space. By analyzing the admissibility of these smoothings, we show both methods admit closed-form updates and match one-sided Shampoo's regret up to a constant factor, while significantly reducing computational cost. Lastly, using the online-to-nonconvex conversion, we derive two matrix-based optimizers, Pion (from FTPL) and Leon (from FAML). We prove convergence guarantees for these methods in nonsmooth nonconvex settings, a guarantee that the popular Muon optimizer lacks.

Adaptive Matrix Online Learning through Smoothing with Guarantees for Nonsmooth Nonconvex Optimization

TL;DR

The paper tackles online matrix optimization under an operator-norm constraint, where existing adaptive methods like Shampoo incur expensive quadratic projections. It extends Gradient-Based Prediction Algorithm (GBPA) to the matrix setting by introducing -admissible smoothings of the nuclear norm, achieving regret matching one-sided Shampoo up to constants. It then develops two efficient algorithms, FTPL (Gaussian smoothing) and FAML (hyperbolic smoothing), that avoid quadratic projections while retaining strong guarantees. Through Online-to-Nonconvex Conversion, the authors derive Pion and Leon with convergence guarantees to -stationary points for nonsmooth nonconvex matrix optimization, addressing gaps in existing methods like Muon. Empirical results on synthetic tasks illustrate stability and improved performance, highlighting the practical impact for spectral-structure-aware optimization in deep learning and quasi-Newton-style online updates.

Abstract

We study online linear optimization with matrix variables constrained by the operator norm, a setting where the geometry renders designing data-dependent and efficient adaptive algorithms challenging. The best-known adaptive regret bounds are achieved by Shampoo-like methods, but they require solving a costly quadratic projection subproblem. To address this, we extend the gradient-based prediction scheme to adaptive matrix online learning and cast algorithm design as constructing a family of smoothed potentials for the nuclear norm. We define a notion of admissibility for such smoothings and prove any admissible smoothing yields a regret bound matching the best-known guarantees of one-sided Shampoo. We instantiate this framework with two efficient methods that avoid quadratic projections. The first is an adaptive Follow-the-Perturbed-Leader (FTPL) method using Gaussian stochastic smoothing. The second is Follow-the-Augmented-Matrix-Leader (FAML), which uses a deterministic hyperbolic smoothing in an augmented matrix space. By analyzing the admissibility of these smoothings, we show both methods admit closed-form updates and match one-sided Shampoo's regret up to a constant factor, while significantly reducing computational cost. Lastly, using the online-to-nonconvex conversion, we derive two matrix-based optimizers, Pion (from FTPL) and Leon (from FAML). We prove convergence guarantees for these methods in nonsmooth nonconvex settings, a guarantee that the popular Muon optimizer lacks.
Paper Structure (37 sections, 19 theorems, 149 equations, 1 figure, 2 tables, 3 algorithms)

This paper contains 37 sections, 19 theorems, 149 equations, 1 figure, 2 tables, 3 algorithms.

Key Result

Lemma 3.1

Define $\mathcal{B}_{f}\!\left({\mathbf{U}} \,\middle\|\, {\mathbf{V}}\right)\!:= \!f({\mathbf{U}}) - f({\mathbf{V}}) - \langle \nabla f({\mathbf{V}}), {\mathbf{U}}-{\mathbf{V}} \rangle$ as the Bregman divergence with respect to a function $f$. If $\{{\mathbf{X}}_t\}$ is generated by eq:GBPA, then i

Figures (1)

  • Figure 1: Convergence paths with different constant learning rates.

Theorems & Definitions (32)

  • Remark 2.1
  • Lemma 3.1
  • Definition 3.1
  • Theorem 3.2
  • Proposition 3.3
  • Theorem 4.1
  • Theorem 4.2
  • Definition 5.1: $(\rho, \varepsilon)$-stationary point
  • Proposition 5.1
  • Theorem 5.2: Convergence of Pion
  • ...and 22 more