Table of Contents
Fetching ...

AdamHD: Decoupled Huber Decay Regularization for Language Model Pre-Training

Fu-Ming Guo, Yingfang Fan

TL;DR

This work tackles late-stage over-decay in large-scale transformer pre-training under decoupled regularization by introducing AdamHD, a drop-in replacement for AdamW that replaces the traditional $L2$-based decay with a decoupled Huber penalty. The method yields bounded regularization gradients and per-coordinate scale invariance, while imposing stronger sparsity pressure on overgrown weights and maintaining $O(1)$ extra cost through a closed-form proximal update. Theoretical analysis shows the proximal Huber step is firmly nonexpansive and provides bounds on decay per update, with limiting cases recovering both decoupled $L2$ and no regularization. Empirically, AdamHD accelerates GPT-2/GPT-3 pre-training by $10$–$15\%$ in wall clock time, reduces validation perplexity by up to $4$ points, improves downstream task performance by $2.5$–$4.7\%$, and yields $20$–$30\%$ memory savings after pruning, without bespoke hyperparameter sweeps. These results demonstrate a simple, robust, and practical improvement for efficient and resilient training of large foundational transformers.

Abstract

Adaptive optimizers with decoupled weight decay, such as AdamW, are the de facto standard for pre-training large transformer-based generative models. Yet the quadratic nature of the $\ell_2$ penalty embedded in weight decay drives all parameters toward the origin at the same rate, making the update vulnerable to rare but extreme gradient directions and often over-penalizing well-conditioned coordinates. We propose AdamHuberDecay, a drop-in replacement for AdamW that substitutes the $\ell_2$ penalty with a decoupled smooth Huber regularizer. The resulting update decays parameters quadratically while their magnitude remains below a threshold $δ$, and linearly ($\ell_1$-like) once they exceed $δ$, yielding (i) bounded regularization gradients, (ii) invariance to per-coordinate second-moment rescaling, and (iii) stronger sparsity pressure on overgrown weights. We derive the closed-form decoupled Huber decay step and show how to integrate it with any Adam-family optimizer at $O(1)$ extra cost. Extensive experiments on GPT-2 and GPT-3 pre-training demonstrate that AdamHuberDecay (a) converges 10-15% faster in wall-clock time, (b) reduces validation perplexity by up to 4 points, (c) delivers performance improvements of 2.5-4.7% across downstream tasks, and (d) yields visibly sparser weight histograms that translate into 20-30% memory savings after magnitude pruning, without tuning the decay coefficient beyond the default grid used for AdamW. Ablations confirm robustness to outlier gradients and large-batch regimes, together with theoretical analyses that bound the expected parameter norm under noisy updates. AdamHuberDecay therefore provides a simple, principled path toward more efficient and resilient training of next-generation foundational generative transformers.

AdamHD: Decoupled Huber Decay Regularization for Language Model Pre-Training

TL;DR

This work tackles late-stage over-decay in large-scale transformer pre-training under decoupled regularization by introducing AdamHD, a drop-in replacement for AdamW that replaces the traditional -based decay with a decoupled Huber penalty. The method yields bounded regularization gradients and per-coordinate scale invariance, while imposing stronger sparsity pressure on overgrown weights and maintaining extra cost through a closed-form proximal update. Theoretical analysis shows the proximal Huber step is firmly nonexpansive and provides bounds on decay per update, with limiting cases recovering both decoupled and no regularization. Empirically, AdamHD accelerates GPT-2/GPT-3 pre-training by in wall clock time, reduces validation perplexity by up to points, improves downstream task performance by , and yields memory savings after pruning, without bespoke hyperparameter sweeps. These results demonstrate a simple, robust, and practical improvement for efficient and resilient training of large foundational transformers.

Abstract

Adaptive optimizers with decoupled weight decay, such as AdamW, are the de facto standard for pre-training large transformer-based generative models. Yet the quadratic nature of the penalty embedded in weight decay drives all parameters toward the origin at the same rate, making the update vulnerable to rare but extreme gradient directions and often over-penalizing well-conditioned coordinates. We propose AdamHuberDecay, a drop-in replacement for AdamW that substitutes the penalty with a decoupled smooth Huber regularizer. The resulting update decays parameters quadratically while their magnitude remains below a threshold , and linearly (-like) once they exceed , yielding (i) bounded regularization gradients, (ii) invariance to per-coordinate second-moment rescaling, and (iii) stronger sparsity pressure on overgrown weights. We derive the closed-form decoupled Huber decay step and show how to integrate it with any Adam-family optimizer at extra cost. Extensive experiments on GPT-2 and GPT-3 pre-training demonstrate that AdamHuberDecay (a) converges 10-15% faster in wall-clock time, (b) reduces validation perplexity by up to 4 points, (c) delivers performance improvements of 2.5-4.7% across downstream tasks, and (d) yields visibly sparser weight histograms that translate into 20-30% memory savings after magnitude pruning, without tuning the decay coefficient beyond the default grid used for AdamW. Ablations confirm robustness to outlier gradients and large-batch regimes, together with theoretical analyses that bound the expected parameter norm under noisy updates. AdamHuberDecay therefore provides a simple, principled path toward more efficient and resilient training of next-generation foundational generative transformers.

Paper Structure

This paper contains 18 sections, 15 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Geometrical illustration of the Huber-norm regularizer and comparison to common ones.
  • Figure 2:
  • Figure 3: Validation loss on FineWeb during pretraining