On Surprising Effectiveness of Masking Updates in Adaptive Optimizers

Taejong Joo; Wenhan Xia; Cheolmin Kim; Ming Zhang; Eugene Ie

On Surprising Effectiveness of Masking Updates in Adaptive Optimizers

Taejong Joo, Wenhan Xia, Cheolmin Kim, Ming Zhang, Eugene Ie

TL;DR

The paper addresses the reliance on dense adaptive optimizers in large-language-model (LLM) training by proposing structured stochastic update masking. It introduces SkipUpdate, a block-wise Bernoulli masking scheme that preserves unbiased updates and induces a curvature-dependent regularizer in the expected loss, as well as Magma, a momentum-aligned masking wrapper that modulates masked updates per block using per-block momentum-gradient alignment with $s_t^{(b)} = \mathrm{sigmoid}( \mathrm{cos\,similarity}(\mu_t^{(b)}, g_t^{(b)})/\tau )$. The authors provide a theoretical descent analysis showing how the masking term adds a curvature-weighted penalty and demonstrate empirically that Magma yields consistent gains across Llama 2 pre-training on C4, Nano MoE pre-training, and controlled benchmarks with heavy-tailed noise and heterogeneous Hessians; notably, 1B-parameter perplexities improve substantially, with RMSProp+Magma achieving the best results. Practically, Magma is a drop-in wrapper with negligible overhead that scales advantageously with model size, offering a new direction for optimization algorithms that leverage structured stochasticity to stabilize training and improve generalization in ill-conditioned transformer landscapes.

Abstract

Training large language models (LLMs) relies almost exclusively on dense adaptive optimizers with increasingly sophisticated preconditioners. We challenge this by showing that randomly masking parameter updates can be highly effective, with a masked variant of RMSProp consistently outperforming recent state-of-the-art optimizers. Our analysis reveals that the random masking induces a curvature-dependent geometric regularization that smooths the optimization trajectory. Motivated by this finding, we introduce Momentum-aligned gradient masking (Magma), which modulates the masked updates using momentum-gradient alignment. Extensive LLM pre-training experiments show that Magma is a simple drop-in replacement for adaptive optimizers with consistent gains and negligible computational overhead. Notably, for the 1B model size, Magma reduces perplexity by over 19\% and 9\% compared to Adam and Muon, respectively.

On Surprising Effectiveness of Masking Updates in Adaptive Optimizers

TL;DR

. The authors provide a theoretical descent analysis showing how the masking term adds a curvature-weighted penalty and demonstrate empirically that Magma yields consistent gains across Llama 2 pre-training on C4, Nano MoE pre-training, and controlled benchmarks with heavy-tailed noise and heterogeneous Hessians; notably, 1B-parameter perplexities improve substantially, with RMSProp+Magma achieving the best results. Practically, Magma is a drop-in wrapper with negligible overhead that scales advantageously with model size, offering a new direction for optimization algorithms that leverage structured stochasticity to stabilize training and improve generalization in ill-conditioned transformer landscapes.

Abstract

Paper Structure (29 sections, 4 theorems, 28 equations, 7 figures, 3 tables, 1 algorithm)

This paper contains 29 sections, 4 theorems, 28 equations, 7 figures, 3 tables, 1 algorithm.

Introduction
Update Masking as a Regularization
Momentum-Aligned Update Masking
Experiments
Pre-Training Llama
Pre-Training Nano MoE
Magma under Heavy-Tailed Gradient Noises
Magma on Heterogeneous Quadratics
Discussion
Literature Review
Conclusion
Proofs of Claims
Proof of Proposition \ref{['lemma:implicit_reg']}
Proof of Lemma \ref{['prop:magma_descent']}
Proof of Lemma \ref{['sup_lem:descent_lower_bound']}
...and 14 more sections

Key Result

Proposition 1

Conditioned on ${\mathcal{F}}_t$, the expected loss of SkipUpdate (cf. equation eq:rsu_def) is

Figures (7)

Figure 1: Pre-training performance on C4 across model scales. Despite discarding half of updates, SkipUpdate yields substantial improvements over state-of-the-art dense optimizers.
Figure 2: Optimization trajectories of pre-training the Nano MoE model on OpenWebText.
Figure 3: Magma on light-tailed and heavy-tailed data distributions.Top: Optimization trajectories for Adam and Magma. Bottom: Robust condition number defined as the ratio between the maximum and median eigenvalues of the loss Hessian.
Figure 4: Magma on homogeneous and heterogeneous quadratics.Top: Optimization trajectories for AdamW and Magma on quadratic objectives with identical eigenspectra but different block structures. Bottom: Average gradient–momentum alignment per block.
Figure A1: Comparison of eval perplexity for different values of sampling ratio $p$ and damping temperature $\tau$.
...and 2 more figures

Theorems & Definitions (8)

Proposition 1
Lemma 4
Lemma 5
Theorem 6
proof
proof
proof
proof

On Surprising Effectiveness of Masking Updates in Adaptive Optimizers

TL;DR

Abstract

On Surprising Effectiveness of Masking Updates in Adaptive Optimizers

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (8)