Table of Contents
Fetching ...

To Use or not to Use Muon: How Simplicity Bias in Optimizers Matters

Sara Dragutinović, Rajesh Ranganath

TL;DR

This paper investigates the potential downsides stemming from the mechanism driving this speedup of Muon, and explores the biases induced when optimizing with Muon, providing theoretical analysis and its consequences to the learning trajectories and solutions learned.

Abstract

For a long period of time, Adam has served as the ubiquitous default choice for training deep neural networks. Recently, many new optimizers have been introduced, out of which Muon has perhaps gained the highest popularity due to its superior training speed. While many papers set out to validate the benefits of Muon, our paper investigates the potential downsides stemming from the mechanism driving this speedup. We explore the biases induced when optimizing with Muon, providing theoretical analysis and its consequences to the learning trajectories and solutions learned. While the theory does provide justification for the benefits Muon brings, it also guides our intuition when coming up with a couple of examples where Muon-optimized models have disadvantages. The core problem we emphasize is that Muon optimization removes a simplicity bias that is naturally preserved by older, more thoroughly studied methods like Stochastic Gradient Descent (SGD). We take first steps toward understanding consequences this may have: Muon might struggle to uncover common underlying structure across tasks, and be more prone to fitting spurious features. More broadly, this paper should serve as a reminder: when developing new optimizers, it is essential to consider the biases they introduce, as these biases can fundamentally change a model's behavior -- for better or for worse.

To Use or not to Use Muon: How Simplicity Bias in Optimizers Matters

TL;DR

This paper investigates the potential downsides stemming from the mechanism driving this speedup of Muon, and explores the biases induced when optimizing with Muon, providing theoretical analysis and its consequences to the learning trajectories and solutions learned.

Abstract

For a long period of time, Adam has served as the ubiquitous default choice for training deep neural networks. Recently, many new optimizers have been introduced, out of which Muon has perhaps gained the highest popularity due to its superior training speed. While many papers set out to validate the benefits of Muon, our paper investigates the potential downsides stemming from the mechanism driving this speedup. We explore the biases induced when optimizing with Muon, providing theoretical analysis and its consequences to the learning trajectories and solutions learned. While the theory does provide justification for the benefits Muon brings, it also guides our intuition when coming up with a couple of examples where Muon-optimized models have disadvantages. The core problem we emphasize is that Muon optimization removes a simplicity bias that is naturally preserved by older, more thoroughly studied methods like Stochastic Gradient Descent (SGD). We take first steps toward understanding consequences this may have: Muon might struggle to uncover common underlying structure across tasks, and be more prone to fitting spurious features. More broadly, this paper should serve as a reminder: when developing new optimizers, it is essential to consider the biases they introduce, as these biases can fundamentally change a model's behavior -- for better or for worse.
Paper Structure (27 sections, 4 theorems, 13 equations, 6 figures, 1 algorithm)

This paper contains 27 sections, 4 theorems, 13 equations, 6 figures, 1 algorithm.

Key Result

Theorem 3.1

Consider the gradient flow dynamics $\dot{U} = -\nabla_U L$ and $\dot{V} = -\nabla_V L$ on the loss $L(U, V)$ starting from infinitesimal initialization. Then:

Figures (6)

  • Figure 1: Illustration of the theory presented in Section \ref{['sec:theory']}, for gradient flow (Left) and Spectral GD (Right). The top row depicts the loss curve, the bottom one the evolution of singular values of $VU$. The evolution of each $u_i$ compared to singular vectors $r_1$ and $r_2$ is shown in circled plots, corresponding to each selected time step: the arrows represent $r_1$ and $r_2$, and each dot is one of $u_i$s (here $d_{in}=d_{out}=2$ and $H=100$, so there are that many $u_i$s). We observe that for GD, first singular vector is fully learned first, and only then the second one is learned. On the other hand, Spectral GD learns both of them in the same time, and after it saturates on the smallest one, then $u_i$s progress only in the direction of the larger one. Simulation is closely following the theory, as expected.
  • Figure 2: a) The neural network used to solve the task, where each of the gray arrows is a linear layer, with no nonlinearities in between. There is $M=7$ input and output domains. Each input domain has its own, fixed 4 orthonormal vectors to represent $\{1,2,3,4\}$. b) The underlying task we're learning: mapping each number in $\{1,2,3,4\}$ to the output vector shown. Results after training in the 'routing' setup with c) SGD and d) Spectral GD, together with e) training loss curves. We plot the function the models learned (4 different column vectors represent the image of $\{1,2,3,4\}$) for all the different input-output pairs of sources. Circled in green are the pairs seen during training.
  • Figure 3: a) MNIST images containing spurious pixel. b) Validation losses and c) accuracies for SGD, Muon and Adam, on both sets with (Sp) and without spurious features. c) Peak accuracy on non-spurious dataset as a function of different intensities of the spurious pixel in the training data.
  • Figure 4: Additional figures supporting the theory from Section \ref{['sec:theory']}. a) $d_\text{in} = d_\text{out}=2$; b) $d_\text{in} = d_\text{out}=3$
  • Figure 5: The oscillations of $\sigma_2(t)$ around $s_2$ in the setting from Figure \ref{['fig:theory']}, shown for different values of the learning rate. This phenomena happens after a principal component is effectively learned by Spectral GD, but not exactly. Then the small noise in the direction of that principal component is amplified by orthogonalization, and the step of order 1 is taken, independently of noise magnitude.
  • ...and 1 more figures

Theorems & Definitions (6)

  • Theorem 3.1: Gradient Flow Dynamics
  • Theorem 3.2: Spectral Gradient Flow Dynamics
  • Theorem
  • proof
  • Theorem
  • proof