The Ky Fan Norms and Beyond: Dual Norms and Combinations for Matrix Optimization
Alexey Kravatskiy, Ivan Kozyrev, Nikolai Kozlov, Alexander Vinogradov, Daniil Merkulov, Ivan Oseledets
TL;DR
<3-5 sentence high-level summary> This work investigates Muon-like optimization for matrix-valued weight functions by moving beyond the spectral norm to duals of Ky Fan norms, introducing the Fanion family of algorithms and their F- and S- hybrids with Normalized SGD and SignSGD. It shows that Neon can be less effective, while intermediate-rank Fanions (e.g., Fanion-k) can interpolate between rank-1 and full-rank updates, maintaining competitive performance with Muon on real tasks. Updates are computed efficiently via thick-restart Lanczos (TRLan) to obtain low-rank approximations, enabling scalable application to large neural networks. Empirically, F-Muon and S-Muon closely match Muon on CIFAR-10 airbench and large-scale language-model benchmarks, with F-Muon offering improved learning-rate robustness. The results suggest substantial flexibility in norm choice for LMO-based optimizers and point to promising directions for theory and practical deployment of non-Euclidean LMOs.
Abstract
In this article, we explore the use of various matrix norms for optimizing functions of weight matrices, a crucial problem in training large language models. Moving beyond the spectral norm underlying the Muon update, we leverage duals of the Ky Fan $k$-norms to introduce a family of Muon-like algorithms we name Fanions, which are closely related to Dion. By working with duals of convex combinations of the Ky Fan $k$-norms with either the Frobenius norm or the $l_\infty$ norm, we construct the families of F-Fanions and S-Fanions, respectively. Their most prominent members are F-Muon and S-Muon. We complement our theoretical analysis with an extensive empirical study of these algorithms across a wide range of tasks and settings, demonstrating that F-Muon and S-Muon consistently match Muon's performance, while outperforming vanilla Muon on a synthetic linear least squares problem.
