Table of Contents
Fetching ...

Exploring Quantization for Efficient Pre-Training of Transformer Language Models

Kamran Chitsaz, Quentin Fournier, Gonçalo Mordido, Sarath Chandar

TL;DR

This study aims to explore the impact of quantization for efficient pre-training of Transformers, with a focus on linear layer components by systematically applying straightforward linear quantization to weights, activations, gradients, and optimizer states.

Abstract

The increasing scale of Transformer models has led to an increase in their pre-training computational requirements. While quantization has proven to be effective after pre-training and during fine-tuning, applying quantization in Transformers during pre-training has remained largely unexplored at scale for language modeling. This study aims to explore the impact of quantization for efficient pre-training of Transformers, with a focus on linear layer components. By systematically applying straightforward linear quantization to weights, activations, gradients, and optimizer states, we assess its effects on model efficiency, stability, and performance during training. By offering a comprehensive recipe of effective quantization strategies to be applied during the pre-training of Transformers, we promote high training efficiency from scratch while retaining language modeling ability. Code is available at https://github.com/chandar-lab/EfficientLLMs.

Exploring Quantization for Efficient Pre-Training of Transformer Language Models

TL;DR

This study aims to explore the impact of quantization for efficient pre-training of Transformers, with a focus on linear layer components by systematically applying straightforward linear quantization to weights, activations, gradients, and optimizer states.

Abstract

The increasing scale of Transformer models has led to an increase in their pre-training computational requirements. While quantization has proven to be effective after pre-training and during fine-tuning, applying quantization in Transformers during pre-training has remained largely unexplored at scale for language modeling. This study aims to explore the impact of quantization for efficient pre-training of Transformers, with a focus on linear layer components. By systematically applying straightforward linear quantization to weights, activations, gradients, and optimizer states, we assess its effects on model efficiency, stability, and performance during training. By offering a comprehensive recipe of effective quantization strategies to be applied during the pre-training of Transformers, we promote high training efficiency from scratch while retaining language modeling ability. Code is available at https://github.com/chandar-lab/EfficientLLMs.
Paper Structure (19 sections, 1 equation, 15 figures, 11 tables)

This paper contains 19 sections, 1 equation, 15 figures, 11 tables.

Figures (15)

  • Figure 1: Overview of the quantization process in forward and backward passes.
  • Figure 2: Distribution of peak memory usage across different model sizes (GPT-2 Small, Medium, and Large) for a constant context length of 1024, with varying batch sizes.
  • Figure 3: Proportion of total execution time consumed by linear layers in the attention block of GPT-2 models (Small, Medium, Large, and X-Large) across different sequence lengths.
  • Figure 4: Comparison of different Weight Quantization schemes. (Down) Validation loss across training iterations for 4-bit and 8-bit quantization, both per-tensor and per-channel, alongside the baseline. (Top) PFew-shot accuracy on downstream tasks for the corresponding quantization approaches, demonstrating the efficacy of 8-bit per-channel weight quantization.
  • Figure 5: Sharpness comparison between baseline model and 4-bit weight quantization. (Top) $m$-sharpness. (Down) Loss surfaces.
  • ...and 10 more figures