Provable Benefit of Sign Descent: A Minimal Model Under Heavy-Tailed Class Imbalance
Robin Yadav, Shuo Xie, Tianhao Wang, Zhiyuan Li
TL;DR
This work isolates a minimal, convex language-modeling task with heavy-tailed class imbalance to show provable advantages of sign-based, non-Euclidean descent over standard normalization. By deriving explicit smoothness constants and optimal-norms for a softmax unigram objective, it demonstrates that sign descent achieves a faster rate in terms of problem-dependent complexity, especially as vocabulary size grows, while also clarifying why adaptive-smoothness explanations do not fully account for the observed gains. The additive logistic transformation variant reinforces that the sign-descent advantage persists under closely related formulations. Overall, the paper provides theoretical separation between non-Euclidean, coordinate-wise methods and gradient descent in a data-imbalance regime relevant to language modelling, and outlines pathways to extend these insights to stochastic and more complex settings.
Abstract
Adaptive optimization methods (such as Adam) play a major role in LLM pretraining, significantly outperforming Gradient Descent (GD). Recent studies have proposed new smoothness assumptions on the loss function to explain the advantages of adaptive algorithms with structured preconditioners, e.g., coordinate-wise or layer-wise, and steepest descent methods w.r.t. non-euclidean norms, e.g., $\ell_\infty$ norm or spectral norm, over GD. However, it remains unclear how these smoothness assumptions manifest in language modelling tasks. In this work, we aim to analyze the benefit of $\ell_\infty$-norm descent (a.k.a. sign descent) directly from properties of the data distribution, namely, heavy-tailed class imbalance. We propose a minimal yet representative setting of next-token prediction, where we can provably show faster convergence of coordinate-wise algorithms such as Sign descent (steepest descent w.r.t. $\ell_\infty$ norm) over normalized GD (steepest descent w.r.t. to $\ell_2$ norm) in the presence of heavy tail class imbalance.
