Table of Contents
Fetching ...

Stable Language Model Pre-training by Reducing Embedding Variability

Woojin Chung, Jiwoo Hong, Na Min An, James Thorne, Se-Young Yun

TL;DR

It is theoretically and empirically demonstrated that Multi-head Low-Rank Attention acts as a fundamental approach to reducing instability, supported by empirical findings on variants on GPT-2, demonstrating improved stability and lower perplexities, even at deeper layer counts.

Abstract

Stable pre-training is essential for achieving better-performing language models. However, tracking pre-training stability by calculating gradient variance at every step is impractical due to the significant computational costs. We explore Token Embedding Variability (TEV) as a simple and efficient proxy for assessing pre-training stability in language models with pre-layer normalization, given that shallower layers are more prone to gradient explosion (section 2.2). Moreover, we propose Multi-head Low-Rank Attention (MLRA) as an architecture to alleviate such instability by limiting the exponential growth of output embedding variance, thereby preventing the gradient explosion (section 3.2). Empirical results on GPT-2 with MLRA demonstrate increased stability and lower perplexity, particularly in deeper models.

Stable Language Model Pre-training by Reducing Embedding Variability

TL;DR

It is theoretically and empirically demonstrated that Multi-head Low-Rank Attention acts as a fundamental approach to reducing instability, supported by empirical findings on variants on GPT-2, demonstrating improved stability and lower perplexities, even at deeper layer counts.

Abstract

Stable pre-training is essential for achieving better-performing language models. However, tracking pre-training stability by calculating gradient variance at every step is impractical due to the significant computational costs. We explore Token Embedding Variability (TEV) as a simple and efficient proxy for assessing pre-training stability in language models with pre-layer normalization, given that shallower layers are more prone to gradient explosion (section 2.2). Moreover, we propose Multi-head Low-Rank Attention (MLRA) as an architecture to alleviate such instability by limiting the exponential growth of output embedding variance, thereby preventing the gradient explosion (section 3.2). Empirical results on GPT-2 with MLRA demonstrate increased stability and lower perplexity, particularly in deeper models.
Paper Structure (28 sections, 22 equations, 4 figures, 1 table)

This paper contains 28 sections, 22 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: TEV distribution for OPT, Pythia, Llama-2, and GPT-2 reveals that as model size grows, both $\mu_{\text{TEV}}$ and $\sigma_{\text{TEV}}$ decrease. This trend correlates with better model performance, as reduced noisy gradients lead to higher pre-training stability and improved performance. For a fair comparison, Pythia 6.9B and 12B were excluded due to their different vocabulary sizes.
  • Figure 2: Gradient variance ($\downarrow$) comparison across tested models with different layers. MLRA shows the lowest gradient variance than GPT-2 and $\sigma$Reparam. GPT-2 with 192 layers was excluded as the training failed 5 times (i.e., The gradient variance is infinite at the earlier steps and becomes infinitesimal in the later steps).
  • Figure 3: $\mu_{\text{TEV}}$ (top) and gradient variance (bottom) during the pre-training of both GPT-2 and MLRA, each with 48 layers, over the course of 1 billion tokens. For both settings, $\mu_{\text{TEV}}$ and gradient variance imply identical trends over the pre-training procedure.
  • Figure 4: Row-wise average of the absolute mean value of $|V|$ token embeddings in the token embedding layer $\mathbf{E} \in \mathbb{R}^{|V| \times d_{\text{model}}}$ across OPT zhang2022opt, Pythia biderman2023pythia, Llama-2 touvron2023llama2 and GPT-2 radford2019language. $\mathbf{E}$ in pre-trained checkpoint remains centered around zero.