Table of Contents
Fetching ...

Initialization of Large Language Models via Reparameterization to Mitigate Loss Spikes

Kosuke Nishida, Kyosuke Nishida, Kuniko Saito

TL;DR

A novel technique, weight scaling as reparameterization (WeSaR), which introduces a gate parameter per parameter matrix and adjusts it to the value satisfying the requirements, which results in stable training.

Abstract

Loss spikes, a phenomenon in which the loss value diverges suddenly, is a fundamental issue in the pre-training of large language models. This paper supposes that the non-uniformity of the norm of the parameters is one of the causes of loss spikes. Here, in training of neural networks, the scale of the gradients is required to be kept constant throughout the layers to avoid the vanishing and exploding gradients problem. However, to meet these requirements in the Transformer model, the norm of the model parameters must be non-uniform, and thus, parameters whose norm is smaller are more sensitive to the parameter update. To address this issue, we propose a novel technique, weight scaling as reparameterization (WeSaR). WeSaR introduces a gate parameter per parameter matrix and adjusts it to the value satisfying the requirements. Because of the gate parameter, WeSaR sets the norm of the original parameters uniformly, which results in stable training. Experimental results with the Transformer decoders consisting of 130 million, 1.3 billion, and 13 billion parameters showed that WeSaR stabilizes and accelerates training and that it outperformed compared methods including popular initialization methods.

Initialization of Large Language Models via Reparameterization to Mitigate Loss Spikes

TL;DR

A novel technique, weight scaling as reparameterization (WeSaR), which introduces a gate parameter per parameter matrix and adjusts it to the value satisfying the requirements, which results in stable training.

Abstract

Loss spikes, a phenomenon in which the loss value diverges suddenly, is a fundamental issue in the pre-training of large language models. This paper supposes that the non-uniformity of the norm of the parameters is one of the causes of loss spikes. Here, in training of neural networks, the scale of the gradients is required to be kept constant throughout the layers to avoid the vanishing and exploding gradients problem. However, to meet these requirements in the Transformer model, the norm of the model parameters must be non-uniform, and thus, parameters whose norm is smaller are more sensitive to the parameter update. To address this issue, we propose a novel technique, weight scaling as reparameterization (WeSaR). WeSaR introduces a gate parameter per parameter matrix and adjusts it to the value satisfying the requirements. Because of the gate parameter, WeSaR sets the norm of the original parameters uniformly, which results in stable training. Experimental results with the Transformer decoders consisting of 130 million, 1.3 billion, and 13 billion parameters showed that WeSaR stabilizes and accelerates training and that it outperformed compared methods including popular initialization methods.
Paper Structure (46 sections, 15 equations, 12 figures, 12 tables)

This paper contains 46 sections, 15 equations, 12 figures, 12 tables.

Figures (12)

  • Figure 1: Loss of Transformer models with 13 billion (13B) parameters at the beginning of training (top). Update ratios for the up and down projection in the last feed-forward layer, $\|\Delta \bm{W}_u\|/\| \bm{W}_u\|$ and $\|\Delta \bm{W}_d\|/\| \bm{W}_d\|$, of the same (bottom). The horizontal lines are the update ratios before the largest spike. The baseline sets $\| \bm{W}_d\|$ smaller than the other parameters. The update ratio of $\bm{W}_d$ is larger at the very beginning and gets smaller after loss spikes occur. The baseline uses standard techniques for stable training, such as gradient clipping.
  • Figure 2: Loss of 13B models during training.
  • Figure 3: Norm of parameters $\| \bm{W}_d\|$ and $\| \bm{W}_u\|$ in the last layer at the beginning of the training. $\| \bm{W}_d\|$ and $\| \bm{W}_u\|$ of the proposed method overlap.
  • Figure 4: Loss of the 1.3B Transformer models at the beginning of the training (top). Update ratios $\|\Delta \bm{W}_d\|/\| \bm{W}_d\|$ and $\|\Delta \bm{W}_u\|/\| \bm{W}_u\|$ of the same (bottom).
  • Figure 5: Loss of the 130M Transformer models at the beginning of the training (top). Update ratios $\|\Delta \bm{W}_d\|/\| \bm{W}_d\|$ and $\|\Delta \bm{W}_u\|/\| \bm{W}_u\|$ of the same (bottom).
  • ...and 7 more figures