Stable Language Model Pre-training by Reducing Embedding Variability

Woojin Chung; Jiwoo Hong; Na Min An; James Thorne; Se-Young Yun

Stable Language Model Pre-training by Reducing Embedding Variability

Woojin Chung, Jiwoo Hong, Na Min An, James Thorne, Se-Young Yun

TL;DR

It is theoretically and empirically demonstrated that Multi-head Low-Rank Attention acts as a fundamental approach to reducing instability, supported by empirical findings on variants on GPT-2, demonstrating improved stability and lower perplexities, even at deeper layer counts.

Abstract

Stable pre-training is essential for achieving better-performing language models. However, tracking pre-training stability by calculating gradient variance at every step is impractical due to the significant computational costs. We explore Token Embedding Variability (TEV) as a simple and efficient proxy for assessing pre-training stability in language models with pre-layer normalization, given that shallower layers are more prone to gradient explosion (section 2.2). Moreover, we propose Multi-head Low-Rank Attention (MLRA) as an architecture to alleviate such instability by limiting the exponential growth of output embedding variance, thereby preventing the gradient explosion (section 3.2). Empirical results on GPT-2 with MLRA demonstrate increased stability and lower perplexity, particularly in deeper models.

Stable Language Model Pre-training by Reducing Embedding Variability

TL;DR

Abstract

Paper Structure (28 sections, 22 equations, 4 figures, 1 table)

This paper contains 28 sections, 22 equations, 4 figures, 1 table.

Introduction
Pre-training Stability Proxy
Preliminaries
Stability and Token Embedding Layer
Gradient explosion
Skewness in token frequency
Token Embedding Variability (TEV)
Mitigating TEV with Factorization
Multi-head Low Rank Attention (MLRA)
Theoretical Analysis
Experiments
Experimental Design
Baseline
Datasets
Results
...and 13 more sections

Figures (4)

Figure 1: TEV distribution for OPT, Pythia, Llama-2, and GPT-2 reveals that as model size grows, both $\mu_{\text{TEV}}$ and $\sigma_{\text{TEV}}$ decrease. This trend correlates with better model performance, as reduced noisy gradients lead to higher pre-training stability and improved performance. For a fair comparison, Pythia 6.9B and 12B were excluded due to their different vocabulary sizes.
Figure 2: Gradient variance ($\downarrow$) comparison across tested models with different layers. MLRA shows the lowest gradient variance than GPT-2 and $\sigma$Reparam. GPT-2 with 192 layers was excluded as the training failed 5 times (i.e., The gradient variance is infinite at the earlier steps and becomes infinitesimal in the later steps).
Figure 3: $\mu_{\text{TEV}}$ (top) and gradient variance (bottom) during the pre-training of both GPT-2 and MLRA, each with 48 layers, over the course of 1 billion tokens. For both settings, $\mu_{\text{TEV}}$ and gradient variance imply identical trends over the pre-training procedure.
Figure 4: Row-wise average of the absolute mean value of $|V|$ token embeddings in the token embedding layer $\mathbf{E} \in \mathbb{R}^{|V| \times d_{\text{model}}}$ across OPT zhang2022opt, Pythia biderman2023pythia, Llama-2 touvron2023llama2 and GPT-2 radford2019language. $\mathbf{E}$ in pre-trained checkpoint remains centered around zero.

Stable Language Model Pre-training by Reducing Embedding Variability

TL;DR

Abstract

Stable Language Model Pre-training by Reducing Embedding Variability

Authors

TL;DR

Abstract

Table of Contents

Figures (4)