Table of Contents
Fetching ...

GradPower: Powering Gradients for Faster Language Model Pre-Training

Jinbo Wang, Mingze Wang, Jiaqi Zhang, Wei Wang, Peng Pei, Xunliang Cai, Weinan E, Lei Wu

Abstract

We propose GradPower, a lightweight gradient-transformation technique for accelerating language model pre-training. Given a gradient vector $g=(g_i)_i$, GradPower first applies the elementwise sign-power transformation: $\varphi_p(g)=({\rm sign}(g_i)|g_i|^p)_{i}$ for a fixed $p>0$, and then feeds the transformed gradient into a base optimizer. Notably, GradPower requires only a single-line code change and no modifications to the base optimizer's internal logic, including the hyperparameters. When applied to Adam (termed AdamPower), GradPower consistently achieves lower terminal loss across diverse architectures (LLaMA, Qwen2MoE), parameter scales (66M to 2B), datasets (C4, OpenWebText), and learning-rate schedules (cosine, warmup-stable-decay). The most pronounced gains are observed when training modern mixture-of-experts models with warmup-stable-decay schedules. GradPower also integrates seamlessly with other state-of-the-art optimizers, such as Muon, yielding further improvements. Finally, we provide theoretical analyses that reveal the underlying mechanism of GradPower and highlight the influence of gradient noise.

GradPower: Powering Gradients for Faster Language Model Pre-Training

Abstract

We propose GradPower, a lightweight gradient-transformation technique for accelerating language model pre-training. Given a gradient vector , GradPower first applies the elementwise sign-power transformation: for a fixed , and then feeds the transformed gradient into a base optimizer. Notably, GradPower requires only a single-line code change and no modifications to the base optimizer's internal logic, including the hyperparameters. When applied to Adam (termed AdamPower), GradPower consistently achieves lower terminal loss across diverse architectures (LLaMA, Qwen2MoE), parameter scales (66M to 2B), datasets (C4, OpenWebText), and learning-rate schedules (cosine, warmup-stable-decay). The most pronounced gains are observed when training modern mixture-of-experts models with warmup-stable-decay schedules. GradPower also integrates seamlessly with other state-of-the-art optimizers, such as Muon, yielding further improvements. Finally, we provide theoretical analyses that reveal the underlying mechanism of GradPower and highlight the influence of gradient noise.

Paper Structure

This paper contains 29 sections, 7 theorems, 54 equations, 7 figures, 2 tables, 1 algorithm.

Key Result

Proposition 4.2

It holds that $u=\frac{1+o(1)}{1+\frac{\epsilon}{\mu^p}}$, a.s.. Letting $\tilde{u}=\frac{1}{1+\frac{\epsilon}{\mu^p}}$, we observe that $\tilde{u}$ is monotonically decreasing w.r.t. $p$. $\blacktriangleleft$$\blacktriangleleft$

Figures (7)

  • Figure 1: Scaling-law comparison of AdamPower and Adam on the C4 dataset for dense LLaMA models and mixture-of-experts Qwen2MoE models.
  • Figure 2: Pre-training LLaMA (0.2B) on C4 using AdamPower with different power $p$'s. The optimal power is $1.2$.
  • Figure 3: AdamPower ($p=1.2$) consistently outperforms Adam in LLaMA pre-training tasks across a range of model sizes, datasets and LR schedulers.
  • Figure 4: AdamPower ($p=1.2$) consistently outperforms Adam in QwenMoE pre-training tasks on C4, across varying model sizes. The learning rate schedule is wsd.
  • Figure 5: (left) AdamPower with Blockwise LR outperforms both AdamPower and Adam with Blockwise LR in LLaMA pre-training. (middle, right) MuonPower (with $p=1.2$) outperforms Muon in LLaMA pre-training.
  • ...and 2 more figures

Theorems & Definitions (13)

  • Example 4.1
  • Proposition 4.2: low-noise regime, $\sigma\ll\mu$
  • Proposition 4.3: high-noise regime, $\mu\ll\sigma$
  • Theorem 4.5: Adagrad; Theorem 1 in defossez2020simple
  • Example 4.7
  • Theorem 4.8: AdagradPower, low-noise regime
  • Example 4.10
  • Theorem 4.11: AdagradPower, high-noise regime
  • Lemma C.1: Descent estimate for the update, high-noise regime
  • proof : Proof of Lemma \ref{['lemma: descent lemma']}
  • ...and 3 more