Drop-Muon: Update Less, Converge Faster
Kaja Gruntkowska, Yassine Maziane, Zheng Qu, Peter Richtárik
TL;DR
This work challenges the conventional practice of updating all neural-network layers at every optimization step. It introduces Drop-Muon, a randomized, layer-wise, non-Euclidean optimizer that updates only a random subnetwork per iteration, coupling backpropagation-aware sampling with LMOs over norm balls. The paper provides convergence guarantees under two smoothness regimes, including the first progressive-training results in non-smooth stochastic settings, and a compute-cost analysis showing full-network updates are not generally optimal. Empirically, Drop-Muon yields consistent wall-clock speedups (up to about 1.4x) over full-network Muon on CNNs trained on MNIST, Fashion-MNIST, and CIFAR-10, validating its practical benefit and theoretical claims. Overall, the approach offers a scalable, theoretically grounded alternative to full-network updates with immediate implications for training large-scale models more efficiently.
Abstract
Conventional wisdom in deep learning optimization dictates updating all layers at every step-a principle followed by all recent state-of-the-art optimizers such as Muon. In this work, we challenge this assumption, showing that full-network updates can be fundamentally suboptimal, both in theory and in practice. We introduce a non-Euclidean Randomized Progressive Training method-Drop-Muon-a simple yet powerful framework that updates only a subset of layers per step according to a randomized schedule, combining the efficiency of progressive training with layer-specific non-Euclidean updates for top-tier performance. We provide rigorous convergence guarantees under both layer-wise smoothness and layer-wise $(L^0, L^1)$-smoothness, covering deterministic and stochastic gradient settings, marking the first such results for progressive training in the stochastic and non-smooth regime. Our cost analysis further reveals that full-network updates are not optimal unless a very specific relationship between layer smoothness constants holds. Through controlled CNN experiments, we empirically demonstrate that Drop-Muon consistently outperforms full-network Muon, achieving the same accuracy up to $1.4\times$ faster in wall-clock time. Together, our results suggest a shift in how large-scale models can be efficiently trained, challenging the status quo and offering a highly efficient, theoretically grounded alternative to full-network updates.
