Table of Contents
Fetching ...

Provable Benefit of Sign Descent: A Minimal Model Under Heavy-Tailed Class Imbalance

Robin Yadav, Shuo Xie, Tianhao Wang, Zhiyuan Li

TL;DR

This work isolates a minimal, convex language-modeling task with heavy-tailed class imbalance to show provable advantages of sign-based, non-Euclidean descent over standard normalization. By deriving explicit smoothness constants and optimal-norms for a softmax unigram objective, it demonstrates that sign descent achieves a faster rate in terms of problem-dependent complexity, especially as vocabulary size grows, while also clarifying why adaptive-smoothness explanations do not fully account for the observed gains. The additive logistic transformation variant reinforces that the sign-descent advantage persists under closely related formulations. Overall, the paper provides theoretical separation between non-Euclidean, coordinate-wise methods and gradient descent in a data-imbalance regime relevant to language modelling, and outlines pathways to extend these insights to stochastic and more complex settings.

Abstract

Adaptive optimization methods (such as Adam) play a major role in LLM pretraining, significantly outperforming Gradient Descent (GD). Recent studies have proposed new smoothness assumptions on the loss function to explain the advantages of adaptive algorithms with structured preconditioners, e.g., coordinate-wise or layer-wise, and steepest descent methods w.r.t. non-euclidean norms, e.g., $\ell_\infty$ norm or spectral norm, over GD. However, it remains unclear how these smoothness assumptions manifest in language modelling tasks. In this work, we aim to analyze the benefit of $\ell_\infty$-norm descent (a.k.a. sign descent) directly from properties of the data distribution, namely, heavy-tailed class imbalance. We propose a minimal yet representative setting of next-token prediction, where we can provably show faster convergence of coordinate-wise algorithms such as Sign descent (steepest descent w.r.t. $\ell_\infty$ norm) over normalized GD (steepest descent w.r.t. to $\ell_2$ norm) in the presence of heavy tail class imbalance.

Provable Benefit of Sign Descent: A Minimal Model Under Heavy-Tailed Class Imbalance

TL;DR

This work isolates a minimal, convex language-modeling task with heavy-tailed class imbalance to show provable advantages of sign-based, non-Euclidean descent over standard normalization. By deriving explicit smoothness constants and optimal-norms for a softmax unigram objective, it demonstrates that sign descent achieves a faster rate in terms of problem-dependent complexity, especially as vocabulary size grows, while also clarifying why adaptive-smoothness explanations do not fully account for the observed gains. The additive logistic transformation variant reinforces that the sign-descent advantage persists under closely related formulations. Overall, the paper provides theoretical separation between non-Euclidean, coordinate-wise methods and gradient descent in a data-imbalance regime relevant to language modelling, and outlines pathways to extend these insights to stochastic and more complex settings.

Abstract

Adaptive optimization methods (such as Adam) play a major role in LLM pretraining, significantly outperforming Gradient Descent (GD). Recent studies have proposed new smoothness assumptions on the loss function to explain the advantages of adaptive algorithms with structured preconditioners, e.g., coordinate-wise or layer-wise, and steepest descent methods w.r.t. non-euclidean norms, e.g., norm or spectral norm, over GD. However, it remains unclear how these smoothness assumptions manifest in language modelling tasks. In this work, we aim to analyze the benefit of -norm descent (a.k.a. sign descent) directly from properties of the data distribution, namely, heavy-tailed class imbalance. We propose a minimal yet representative setting of next-token prediction, where we can provably show faster convergence of coordinate-wise algorithms such as Sign descent (steepest descent w.r.t. norm) over normalized GD (steepest descent w.r.t. to norm) in the presence of heavy tail class imbalance.

Paper Structure

This paper contains 15 sections, 11 theorems, 48 equations, 3 figures.

Key Result

Theorem 2.1

For any minimizer $x_\star$, suppose we run normalized steepest descent with weight decay of $\lambda \leq \frac{1}{{\left\| x_\star \right\|}}$ and learning rate of $\eta_t = \frac{2}{\lambda(t+1)}$. Suppose $B = \max\{\frac{1}{\lambda}, {\left\| x_0 \right\|}\}$. Then the iterates $\{x_t\}_{t=1}^T In particular, if we initialize $x_0=0$ and select $\lambda$ optimally, i.e., $\lambda = 1/ \min_{x

Figures (3)

  • Figure 1: GD and NormGD struggle to optimize a simple softmax unigram model with heavy-tail class imbalance . This result holds on a real-world dataset and synthetically generated data following a power-law distribution $p_k \propto \frac{1}{k}$.
  • Figure 2: Performance of NSD with weight decay when minimizing $f$ with $d = 10^3$. For each optimizer, we set $\lambda = \frac{1}{\min_{\theta_\star \in \mathop{\mathrm{arg\,min}}\limits f }{\left\| \theta_\star \right\|}}$ and use a learning rate of $\eta_t = \frac{2}{\lambda(t+1)}$.
  • Figure 3: Performance of NSD with weight decay when minimizing $f$ with $d = 10^3$. For each optimizer, we set $\lambda = \frac{1}{\min_{\theta_\star \in \mathop{\mathrm{arg\,min}}\limits f }{\left\| \theta_\star \right\|}}$ and use a learning rate of $\eta_t = \frac{2}{\lambda(t+1)}$.

Theorems & Definitions (12)

  • Theorem 2.1
  • Lemma 3.1
  • Lemma 3.2
  • Lemma 3.3
  • Theorem 3.1
  • Corollary 3.1
  • Definition 4.1: Adaptive Smoothness
  • Lemma 4.1
  • Lemma B.1
  • Lemma B.2
  • ...and 2 more