Table of Contents
Fetching ...

Sharp Capacity Scaling of Spectral Optimizers in Learning Associative Memory

Juno Kim, Eshaan Nichani, Denny Wu, Alberto Bietti, Jason D. Lee

Abstract

Spectral optimizers such as Muon have recently shown strong empirical performance in large-scale language model training, but the source and extent of their advantage remain poorly understood. We study this question through the linear associative memory problem, a tractable model for factual recall in transformer-based models. In particular, we go beyond orthogonal embeddings and consider Gaussian inputs and outputs, which allows the number of stored associations to greatly exceed the embedding dimension. Our main result sharply characterizes the recovery rates of one step of Muon and SGD on the logistic regression loss under a power law frequency distribution. We show that the storage capacity of Muon significantly exceeds that of SGD, and moreover Muon saturates at a larger critical batch size. We further analyze the multi-step dynamics under a thresholded gradient approximation and show that Muon achieves a substantially faster initial recovery rate than SGD, while both methods eventually converge to the information-theoretic limit at comparable speeds. Experiments on synthetic tasks validate the predicted scaling laws. Our analysis provides a quantitative understanding of the signal amplification of Muon and lays the groundwork for establishing scaling laws across more practical language modeling tasks and optimizers.

Sharp Capacity Scaling of Spectral Optimizers in Learning Associative Memory

Abstract

Spectral optimizers such as Muon have recently shown strong empirical performance in large-scale language model training, but the source and extent of their advantage remain poorly understood. We study this question through the linear associative memory problem, a tractable model for factual recall in transformer-based models. In particular, we go beyond orthogonal embeddings and consider Gaussian inputs and outputs, which allows the number of stored associations to greatly exceed the embedding dimension. Our main result sharply characterizes the recovery rates of one step of Muon and SGD on the logistic regression loss under a power law frequency distribution. We show that the storage capacity of Muon significantly exceeds that of SGD, and moreover Muon saturates at a larger critical batch size. We further analyze the multi-step dynamics under a thresholded gradient approximation and show that Muon achieves a substantially faster initial recovery rate than SGD, while both methods eventually converge to the information-theoretic limit at comparable speeds. Experiments on synthetic tasks validate the predicted scaling laws. Our analysis provides a quantitative understanding of the signal amplification of Muon and lays the groundwork for establishing scaling laws across more practical language modeling tasks and optimizers.

Paper Structure

This paper contains 52 sections, 35 theorems, 343 equations, 10 figures.

Key Result

Theorem 1

Let $d$ be the embedding dimension and $B$ be the batch size, and suppose the $i$th item has power law frequency $p_i \propto i^{-\alpha}$ for $\alpha > 1$. One step of Muon on the associative memory task recovers the top $\widetilde{\Theta}(\min\{d^{1 + \frac{1}{2\alpha}}, B^{\frac{1}{\alpha}}\})$

Figures (10)

  • Figure 1: (a) Capacity achieved by one Muon and GD step on the population objective; Muon improves the storage capacity when frequency is power-law distributed with exponent $\alpha>1$. (b) Critical batch size for the first Muon and SGD step ($\alpha=1.5$); the Muon capacity saturates at a much larger batch size than SGD.
  • Figure 2: Capacity scaling after one population Muon and GD step. We set $N=100,000$ and vary $d,\alpha$. Each experiment is repeated 16 times. For each $\alpha$, we fit the dimension exponents of the mean capacity $d^{C_\alpha}$ (dashed lines), and then find the best fit of exponents $C_\alpha$ in the form of $C_\alpha = c_1 + \frac{c_2}{\alpha}$ (solid lines). Observe that Muon achieves much higher storage than GD, and the exponents are consistent with Theorems \ref{['thm:main']} and \ref{['thm:gd']}.
  • Figure 3: Capacity scaling after one Muon and SGD step on empirical loss. We set $N=100{,}000, \alpha=1.5$, and vary the minibatch size $B$. Each experiment is repeated 16 times. The dashed red line indicates the information-theoretic rate, and the horizontal dashed lines in Figure \ref{['fig:muon-batch']} correspond to the $d^{1+\frac{1}{2\alpha}}$ ceiling; the predicted critical batch sizes are given by their intersections. Observe that Muon offers capacity gain over SGD only at sufficiently large $B$, and the empirical critical batch sizes match well with our predictions.
  • Figure 4: Capacity after $T$ Muon steps on the population cross-entropy loss. We set $N=250{,}000$, $\eta=2\sqrt{d}$. Figures \ref{['fig:step-2']}, \ref{['fig:step-3']}, \ref{['fig:step-4']} report the capacity at $T = 2,3,4$, respectively (see Figure \ref{['fig:muon-population']} for $T=1$); Figure \ref{['fig:step-convergence']} presents the capacity at large $T$: we run Muon for up to $500$ steps and early stop when the capacity improvement over $10$ steps drops below $0.5\%$. Figure \ref{['fig:multi-step-exponent']} compares the fitted dimension exponents against predictions of Theorem \ref{['thm:multi']}; observe that the exponents agree except at small $\alpha$ and large $T$.
  • Figure 5: Capacity scaling of multi-step Muon and GD. We set $N=100,000$, $\alpha=1.5$. (a) Population update: for GD we implement an increasing learning rate schedule (see Theorem \ref{['thm:gd-multi']}) with $\eta_1 = 0.01\sqrt{d}$; for Muon we use a fixed step size $\eta=\sqrt{d}$. Observe that the benefit of Muon is most visible in the "early phase" of training (the initial plateau of GD in the first 3 steps is due to small $\eta_1$ chosen for numerical stability). (b) Capacity of minibatch Muon vs. total sample size $B\times T$; for each batch size $B$, we run minibatch Muon for $T=20$ steps with $\eta=\sqrt{d}$. Dashed red line indicates the information-theoretic rate $(BT)^{1/\alpha}$.
  • ...and 5 more figures

Theorems & Definitions (63)

  • Theorem 1: Informal version of Theorems \ref{['thm:main']}, \ref{['thm:gd']}
  • Theorem 2: Informal version of Theorems \ref{['thm:multi']}, \ref{['thm:gd-multi']}
  • Remark
  • Theorem 3: one-step recovery of Muon
  • Corollary 4
  • Theorem 5: one-step recovery of SGD
  • Proposition 6
  • Lemma 8
  • Theorem 9: multi-step recovery of Muon
  • Theorem 10: multi-step recovery of SGD
  • ...and 53 more