Table of Contents
Fetching ...

Drop-Muon: Update Less, Converge Faster

Kaja Gruntkowska, Yassine Maziane, Zheng Qu, Peter Richtárik

TL;DR

This work challenges the conventional practice of updating all neural-network layers at every optimization step. It introduces Drop-Muon, a randomized, layer-wise, non-Euclidean optimizer that updates only a random subnetwork per iteration, coupling backpropagation-aware sampling with LMOs over norm balls. The paper provides convergence guarantees under two smoothness regimes, including the first progressive-training results in non-smooth stochastic settings, and a compute-cost analysis showing full-network updates are not generally optimal. Empirically, Drop-Muon yields consistent wall-clock speedups (up to about 1.4x) over full-network Muon on CNNs trained on MNIST, Fashion-MNIST, and CIFAR-10, validating its practical benefit and theoretical claims. Overall, the approach offers a scalable, theoretically grounded alternative to full-network updates with immediate implications for training large-scale models more efficiently.

Abstract

Conventional wisdom in deep learning optimization dictates updating all layers at every step-a principle followed by all recent state-of-the-art optimizers such as Muon. In this work, we challenge this assumption, showing that full-network updates can be fundamentally suboptimal, both in theory and in practice. We introduce a non-Euclidean Randomized Progressive Training method-Drop-Muon-a simple yet powerful framework that updates only a subset of layers per step according to a randomized schedule, combining the efficiency of progressive training with layer-specific non-Euclidean updates for top-tier performance. We provide rigorous convergence guarantees under both layer-wise smoothness and layer-wise $(L^0, L^1)$-smoothness, covering deterministic and stochastic gradient settings, marking the first such results for progressive training in the stochastic and non-smooth regime. Our cost analysis further reveals that full-network updates are not optimal unless a very specific relationship between layer smoothness constants holds. Through controlled CNN experiments, we empirically demonstrate that Drop-Muon consistently outperforms full-network Muon, achieving the same accuracy up to $1.4\times$ faster in wall-clock time. Together, our results suggest a shift in how large-scale models can be efficiently trained, challenging the status quo and offering a highly efficient, theoretically grounded alternative to full-network updates.

Drop-Muon: Update Less, Converge Faster

TL;DR

This work challenges the conventional practice of updating all neural-network layers at every optimization step. It introduces Drop-Muon, a randomized, layer-wise, non-Euclidean optimizer that updates only a random subnetwork per iteration, coupling backpropagation-aware sampling with LMOs over norm balls. The paper provides convergence guarantees under two smoothness regimes, including the first progressive-training results in non-smooth stochastic settings, and a compute-cost analysis showing full-network updates are not generally optimal. Empirically, Drop-Muon yields consistent wall-clock speedups (up to about 1.4x) over full-network Muon on CNNs trained on MNIST, Fashion-MNIST, and CIFAR-10, validating its practical benefit and theoretical claims. Overall, the approach offers a scalable, theoretically grounded alternative to full-network updates with immediate implications for training large-scale models more efficiently.

Abstract

Conventional wisdom in deep learning optimization dictates updating all layers at every step-a principle followed by all recent state-of-the-art optimizers such as Muon. In this work, we challenge this assumption, showing that full-network updates can be fundamentally suboptimal, both in theory and in practice. We introduce a non-Euclidean Randomized Progressive Training method-Drop-Muon-a simple yet powerful framework that updates only a subset of layers per step according to a randomized schedule, combining the efficiency of progressive training with layer-specific non-Euclidean updates for top-tier performance. We provide rigorous convergence guarantees under both layer-wise smoothness and layer-wise -smoothness, covering deterministic and stochastic gradient settings, marking the first such results for progressive training in the stochastic and non-smooth regime. Our cost analysis further reveals that full-network updates are not optimal unless a very specific relationship between layer smoothness constants holds. Through controlled CNN experiments, we empirically demonstrate that Drop-Muon consistently outperforms full-network Muon, achieving the same accuracy up to faster in wall-clock time. Together, our results suggest a shift in how large-scale models can be efficiently trained, challenging the status quo and offering a highly efficient, theoretically grounded alternative to full-network updates.

Paper Structure

This paper contains 54 sections, 20 theorems, 271 equations, 9 figures, 3 algorithms.

Key Result

Theorem 4.1

Let Assumptions as:lower_bound and as:arbitrary_layer_smoothness hold, and let $\{X^k\}_{k=0}^{K-1}$ be the iterates of alg:rt_arbitrary run with stepsizes $\gamma_i^k = 1/L_{i,S^k}^0$. Then where $w_i := \sum_{s=1}^i \frac{p_s}{2 L_{i, \{s,\dots,b\}}^0}$.

Figures (9)

  • Figure 1: Evolution of the training accuracy for Muon and Drop-Muon with uniform index sampling on MNIST. Batch size $=8192$, learning rate $=0.1$, channels $=[64,128,256]$.
  • Figure 2: Averaged time-to-target speed-up over multiple runs comparing Muon and Drop-Muon with epoch-shift index sampling. Left: MNIST with batch size $=8192$, learning rate $=0.1$, and channels $=[64,128,256]$. Right: Fashion-MNIST with batch size $=32768$, learning rate $=0.1$, and channels $=[64,128,256]$.
  • Figure 3: Evolution of the training accuracy for Muon and Drop-Muon with epoch-shift index sampling on CIFAR-10. Batch size $=8192$, learning rate $=0.1$, channels $=[128,256,512]$.
  • Figure 4: Evolution of the layer sampling distribution as a function of the epochs. Shallow layers are more trained in the first epochs but their probabilities of being sampled decrease with the epochs. This effect can be amplified or reduced by varying the value of $\alpha$; here we chose $\alpha=0.5$.
  • Figure 5: Normalized curve averaging of several runs of Muon and Drop-Muon with uniform index sampling on MNIST. Batch size $=8192$, learning rate $=0.1$, channels $=[64,128,256]$.
  • ...and 4 more figures

Theorems & Definitions (46)

  • Theorem 4.1
  • Theorem 4.2
  • Theorem 4.3
  • Theorem 4.4
  • Lemma B.1
  • proof
  • Remark B.1
  • Remark B.2
  • Theorem D.1
  • Remark D.2
  • ...and 36 more