Table of Contents
Fetching ...

Muon Optimizer Accelerates Grokking

Amund Tveit, Bjørn Remseth, Arve Skogvold

TL;DR

This work investigates how the choice of optimizer affects the grokking phenomenon—delayed generalization observed in overparameterized models—by systematically comparing Muon to AdamW across seven modular-arithmetic tasks using a Transformer with RoPE and RMSNorm. By examining three softmax variants alongside the optimizers, the study provides empirical evidence that Muon speeds up the onset of grokking, reducing the mean grokking epoch from $153.09$ to $102.89$ with strong statistical significance ($t\approx 5.0175$, $p\approx 6.33\times 10^{-8}$). The results suggest that optimizer update geometry, including spectral-norm constraints and near-second-order updates, plays a crucial role in moving from memorization to generalization more quickly. These findings have practical implications for training efficiency and generalization in algorithmic tasks, motivating further exploration on larger models and diverse domains.

Abstract

This paper investigates the impact of different optimizers on the grokking phenomenon, where models exhibit delayed generalization. We conducted experiments across seven numerical tasks (primarily modular arithmetic) using a modern Transformer architecture. The experimental configuration systematically varied the optimizer (Muon vs. AdamW) and the softmax activation function (standard softmax, stablemax, and sparsemax) to assess their combined effect on learning dynamics. Our empirical evaluation reveals that the Muon optimizer, characterized by its use of spectral norm constraints and second-order information, significantly accelerates the onset of grokking compared to the widely used AdamW optimizer. Specifically, Muon reduced the mean grokking epoch from 153.09 to 102.89 across all configurations, a statistically significant difference (t = 5.0175, p = 6.33e-08). This suggests that the optimizer choice plays a crucial role in facilitating the transition from memorization to generalization.

Muon Optimizer Accelerates Grokking

TL;DR

This work investigates how the choice of optimizer affects the grokking phenomenon—delayed generalization observed in overparameterized models—by systematically comparing Muon to AdamW across seven modular-arithmetic tasks using a Transformer with RoPE and RMSNorm. By examining three softmax variants alongside the optimizers, the study provides empirical evidence that Muon speeds up the onset of grokking, reducing the mean grokking epoch from to with strong statistical significance (, ). The results suggest that optimizer update geometry, including spectral-norm constraints and near-second-order updates, plays a crucial role in moving from memorization to generalization more quickly. These findings have practical implications for training efficiency and generalization in algorithmic tasks, motivating further exploration on larger models and diverse domains.

Abstract

This paper investigates the impact of different optimizers on the grokking phenomenon, where models exhibit delayed generalization. We conducted experiments across seven numerical tasks (primarily modular arithmetic) using a modern Transformer architecture. The experimental configuration systematically varied the optimizer (Muon vs. AdamW) and the softmax activation function (standard softmax, stablemax, and sparsemax) to assess their combined effect on learning dynamics. Our empirical evaluation reveals that the Muon optimizer, characterized by its use of spectral norm constraints and second-order information, significantly accelerates the onset of grokking compared to the widely used AdamW optimizer. Specifically, Muon reduced the mean grokking epoch from 153.09 to 102.89 across all configurations, a statistically significant difference (t = 5.0175, p = 6.33e-08). This suggests that the optimizer choice plays a crucial role in facilitating the transition from memorization to generalization.

Paper Structure

This paper contains 8 sections, 5 figures.

Figures (5)

  • Figure 1: Key mechanisms by which the Muon optimizer may accelerate grokking compared to AdamW.
  • Figure 2: Overview of datasets used in the experiments.
  • Figure 3: Softmax variants compared in the experiments.
  • Figure 4: Distribution of grokking epochs for Muon and AdamW optimizers across all tasks and softmax configurations. The boxplot shows medians, quartiles, and potential outliers, indicating Muon tends to grok earlier.
  • Figure 5: Mean number of epochs required to reach the grokking threshold ($\ge$95% validation accuracy) for each optimizer, averaged across all experimental conditions.