Table of Contents
Fetching ...

The Ky Fan Norms and Beyond: Dual Norms and Combinations for Matrix Optimization

Alexey Kravatskiy, Ivan Kozyrev, Nikolai Kozlov, Alexander Vinogradov, Daniil Merkulov, Ivan Oseledets

TL;DR

<3-5 sentence high-level summary> This work investigates Muon-like optimization for matrix-valued weight functions by moving beyond the spectral norm to duals of Ky Fan norms, introducing the Fanion family of algorithms and their F- and S- hybrids with Normalized SGD and SignSGD. It shows that Neon can be less effective, while intermediate-rank Fanions (e.g., Fanion-k) can interpolate between rank-1 and full-rank updates, maintaining competitive performance with Muon on real tasks. Updates are computed efficiently via thick-restart Lanczos (TRLan) to obtain low-rank approximations, enabling scalable application to large neural networks. Empirically, F-Muon and S-Muon closely match Muon on CIFAR-10 airbench and large-scale language-model benchmarks, with F-Muon offering improved learning-rate robustness. The results suggest substantial flexibility in norm choice for LMO-based optimizers and point to promising directions for theory and practical deployment of non-Euclidean LMOs.

Abstract

In this article, we explore the use of various matrix norms for optimizing functions of weight matrices, a crucial problem in training large language models. Moving beyond the spectral norm underlying the Muon update, we leverage duals of the Ky Fan $k$-norms to introduce a family of Muon-like algorithms we name Fanions, which are closely related to Dion. By working with duals of convex combinations of the Ky Fan $k$-norms with either the Frobenius norm or the $l_\infty$ norm, we construct the families of F-Fanions and S-Fanions, respectively. Their most prominent members are F-Muon and S-Muon. We complement our theoretical analysis with an extensive empirical study of these algorithms across a wide range of tasks and settings, demonstrating that F-Muon and S-Muon consistently match Muon's performance, while outperforming vanilla Muon on a synthetic linear least squares problem.

The Ky Fan Norms and Beyond: Dual Norms and Combinations for Matrix Optimization

TL;DR

<3-5 sentence high-level summary> This work investigates Muon-like optimization for matrix-valued weight functions by moving beyond the spectral norm to duals of Ky Fan norms, introducing the Fanion family of algorithms and their F- and S- hybrids with Normalized SGD and SignSGD. It shows that Neon can be less effective, while intermediate-rank Fanions (e.g., Fanion-k) can interpolate between rank-1 and full-rank updates, maintaining competitive performance with Muon on real tasks. Updates are computed efficiently via thick-restart Lanczos (TRLan) to obtain low-rank approximations, enabling scalable application to large neural networks. Empirically, F-Muon and S-Muon closely match Muon on CIFAR-10 airbench and large-scale language-model benchmarks, with F-Muon offering improved learning-rate robustness. The results suggest substantial flexibility in norm choice for LMO-based optimizers and point to promising directions for theory and practical deployment of non-Euclidean LMOs.

Abstract

In this article, we explore the use of various matrix norms for optimizing functions of weight matrices, a crucial problem in training large language models. Moving beyond the spectral norm underlying the Muon update, we leverage duals of the Ky Fan -norms to introduce a family of Muon-like algorithms we name Fanions, which are closely related to Dion. By working with duals of convex combinations of the Ky Fan -norms with either the Frobenius norm or the norm, we construct the families of F-Fanions and S-Fanions, respectively. Their most prominent members are F-Muon and S-Muon. We complement our theoretical analysis with an extensive empirical study of these algorithms across a wide range of tasks and settings, demonstrating that F-Muon and S-Muon consistently match Muon's performance, while outperforming vanilla Muon on a synthetic linear least squares problem.

Paper Structure

This paper contains 42 sections, 5 theorems, 37 equations, 13 figures, 5 tables.

Key Result

lemma 1

When $\lVert \cdot\rVert = \|\cdot\|_{*}$, eq:our_update becomes

Figures (13)

  • Figure 1: Linear least squares problem for a 500x500 matrix.
  • Figure 2: Mean validation accuracies for F-Muon with different $\alpha$.
  • Figure 3: Visualization of the LMO balls for Muon and F-Muon for CNN training.
  • Figure 4: The validation loss for NanoGPT.
  • Figure 5: The validation loss for GPT-2 Medium.
  • ...and 8 more figures

Theorems & Definitions (9)

  • lemma 1
  • proof
  • lemma 2
  • proof
  • lemma 3
  • lemma 4
  • proof
  • corollary 1
  • proof