Table of Contents
Fetching ...

Preconditioned Norms: A Unified Framework for Steepest Descent, Quasi-Newton and Adaptive Methods

Andrey Veprikov, Arman Bolatov, Samuel Horváth, Aleksandr Beznosikov, Martin Takáč, Slavomir Hanzely

TL;DR

This work presents a unified optimization framework based on preconditioned matrix norms that subsumes steepest-descent, quasi-Newton, and adaptive methods through two norm families, the $(L,R)$-norm and the $D$-norm. It formalizes how linear minimization oracles operate in transformed gradient spaces and derives both necessary and sufficient invariance conditions for affine and scale transformations in matrix-parameter settings, enabling principled algorithm design. The authors introduce MuAdam and MuAdam-SANIA, blending spectral geometry with Adam-style preconditioning, and demonstrate competitive performance across scale-invariance tests, GLUE fine-tuning, large language model fine-tuning, and character-level modeling. By connecting diverse optimization approaches under a single theoretical umbrella, the paper highlights a rich design space for robust, geometry-aware optimizers with practical impact on real-world deep learning tasks.

Abstract

Optimization lies at the core of modern deep learning, yet existing methods often face a fundamental trade-off between adapting to problem geometry and leveraging curvature utilization. Steepest descent algorithms adapt to different geometries through norm choices but remain strictly first-order, whereas quasi-Newton and adaptive optimizers incorporate curvature information but are restricted to Frobenius geometry, limiting their applicability across diverse architectures. In this work, we propose a unified framework generalizing steepest descent, quasi-Newton methods, and adaptive methods through the novel notion of preconditioned matrix norms. This abstraction reveals that widely used optimizers such as SGD and Adam, as well as more advanced approaches like Muon and KL-Shampoo, and recent hybrids including SOAP and SPlus, all emerge as special cases of the same principle. Within this framework, we provide the first systematic treatment of affine and scale invariance in the matrix-parameterized setting, establishing necessary and sufficient conditions under generalized norms. Building on this foundation, we introduce two new methods, $\texttt{MuAdam}$ and $\texttt{MuAdam-SANIA}$, which combine the spectral geometry of Muon with Adam-style preconditioning. Our experiments demonstrate that these optimizers are competitive with, and in some cases outperform, existing state-of-the-art methods. Our code is available at https://github.com/brain-lab-research/LIB/tree/quasi_descent

Preconditioned Norms: A Unified Framework for Steepest Descent, Quasi-Newton and Adaptive Methods

TL;DR

This work presents a unified optimization framework based on preconditioned matrix norms that subsumes steepest-descent, quasi-Newton, and adaptive methods through two norm families, the -norm and the -norm. It formalizes how linear minimization oracles operate in transformed gradient spaces and derives both necessary and sufficient invariance conditions for affine and scale transformations in matrix-parameter settings, enabling principled algorithm design. The authors introduce MuAdam and MuAdam-SANIA, blending spectral geometry with Adam-style preconditioning, and demonstrate competitive performance across scale-invariance tests, GLUE fine-tuning, large language model fine-tuning, and character-level modeling. By connecting diverse optimization approaches under a single theoretical umbrella, the paper highlights a rich design space for robust, geometry-aware optimizers with practical impact on real-world deep learning tasks.

Abstract

Optimization lies at the core of modern deep learning, yet existing methods often face a fundamental trade-off between adapting to problem geometry and leveraging curvature utilization. Steepest descent algorithms adapt to different geometries through norm choices but remain strictly first-order, whereas quasi-Newton and adaptive optimizers incorporate curvature information but are restricted to Frobenius geometry, limiting their applicability across diverse architectures. In this work, we propose a unified framework generalizing steepest descent, quasi-Newton methods, and adaptive methods through the novel notion of preconditioned matrix norms. This abstraction reveals that widely used optimizers such as SGD and Adam, as well as more advanced approaches like Muon and KL-Shampoo, and recent hybrids including SOAP and SPlus, all emerge as special cases of the same principle. Within this framework, we provide the first systematic treatment of affine and scale invariance in the matrix-parameterized setting, establishing necessary and sufficient conditions under generalized norms. Building on this foundation, we introduce two new methods, and , which combine the spectral geometry of Muon with Adam-style preconditioning. Our experiments demonstrate that these optimizers are competitive with, and in some cases outperform, existing state-of-the-art methods. Our code is available at https://github.com/brain-lab-research/LIB/tree/quasi_descent

Paper Structure

This paper contains 28 sections, 4 theorems, 46 equations, 2 figures, 8 tables, 1 algorithm.

Key Result

theorem 1

The linear minimization oracles for $(L,R)$-norm and $D$-norm can be expressed asFor an invertible matrix $M$, we use shorthand notation for the inverse transpose as $M^{-T}$.

Figures (2)

  • Figure 1: Scale invariance experiment (Mushrooms, LIBSVM) with a two-layer MLP. Training loss (left, log-scale) and test accuracy (right) on original vs. scaled inputs.
  • Figure 2: LLM fine-tuning results on Qwen2-7B: mean final accuracy with standard deviation across three seeds.

Theorems & Definitions (10)

  • definition 1
  • definition 2
  • theorem 1
  • theorem 2
  • corollary 1
  • proof
  • proof
  • theorem 3
  • proof
  • proof : Proof Theorem \ref{['theorem:affine_invariance']}