Table of Contents
Fetching ...

Optimizing Large Language Model Training Using FP4 Quantization

Ruizhe Wang, Yeyun Gong, Xiao Liu, Guoshuai Zhao, Ziyue Yang, Baining Guo, Zhengjun Zha, Peng Cheng

TL;DR

The paper tackles the high computational cost of pretraining large language models by proposing FP4 training, an ultra-low precision framework designed to maintain accuracy. It introduces a differentiable gradient estimator to enable meaningful weight updates and an outlier clamping/compensation scheme to stabilize activations, achieving performance close to BF16 and FP8 baselines on models up to 13B parameters trained for up to 100B tokens. Through experiments, it demonstrates near-parity in training loss and downstream task performance, while detailing a theoretical speedup analysis and practical overheads. The work further argues for future hardware support for FP4 to unlock substantial energy savings and throughput in large-scale AI training.

Abstract

The growing computational demands of training large language models (LLMs) necessitate more efficient methods. Quantized training presents a promising solution by enabling low-bit arithmetic operations to reduce these costs. While FP8 precision has demonstrated feasibility, leveraging FP4 remains a challenge due to significant quantization errors and limited representational capacity. This work introduces the first FP4 training framework for LLMs, addressing these challenges with two key innovations: a differentiable quantization estimator for precise weight updates and an outlier clamping and compensation strategy to prevent activation collapse. To ensure stability, the framework integrates a mixed-precision training scheme and vector-wise quantization. Experimental results demonstrate that our FP4 framework achieves accuracy comparable to BF16 and FP8, with minimal degradation, scaling effectively to 13B-parameter LLMs trained on up to 100B tokens. With the emergence of next-generation hardware supporting FP4, our framework sets a foundation for efficient ultra-low precision training.

Optimizing Large Language Model Training Using FP4 Quantization

TL;DR

The paper tackles the high computational cost of pretraining large language models by proposing FP4 training, an ultra-low precision framework designed to maintain accuracy. It introduces a differentiable gradient estimator to enable meaningful weight updates and an outlier clamping/compensation scheme to stabilize activations, achieving performance close to BF16 and FP8 baselines on models up to 13B parameters trained for up to 100B tokens. Through experiments, it demonstrates near-parity in training loss and downstream task performance, while detailing a theoretical speedup analysis and practical overheads. The work further argues for future hardware support for FP4 to unlock substantial energy savings and throughput in large-scale AI training.

Abstract

The growing computational demands of training large language models (LLMs) necessitate more efficient methods. Quantized training presents a promising solution by enabling low-bit arithmetic operations to reduce these costs. While FP8 precision has demonstrated feasibility, leveraging FP4 remains a challenge due to significant quantization errors and limited representational capacity. This work introduces the first FP4 training framework for LLMs, addressing these challenges with two key innovations: a differentiable quantization estimator for precise weight updates and an outlier clamping and compensation strategy to prevent activation collapse. To ensure stability, the framework integrates a mixed-precision training scheme and vector-wise quantization. Experimental results demonstrate that our FP4 framework achieves accuracy comparable to BF16 and FP8, with minimal degradation, scaling effectively to 13B-parameter LLMs trained on up to 100B tokens. With the emergence of next-generation hardware supporting FP4, our framework sets a foundation for efficient ultra-low precision training.

Paper Structure

This paper contains 19 sections, 28 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: Directly casting to FP4 results in significantly higher training loss, whereas our proposed FP4 method achieves accuracy comparable to the BF16 baseline. These results are based on experiments with a 400M LLaMA2 model.
  • Figure 2: The structure of the proposed FP4 training scheme during the forward pass of a linear layer. A high-precision tensor, such as BF16, is quantized into the FP4 format using look-up table quantization. During the GeMM computation, both weight and activation tensors are quantized into FP4 to leverage the FP4 tensor cores. Two scaling factors are then applied to the final result to ensure computational correctness.
  • Figure 3: Visualization of the Differentiable Gradient Estimator (DGE). (a) Comparison of three quantization methods: hard quantization, differentiable quantization, and STE quantization, demonstrated on a single quantization step. (b) The full quantization curve for E2M1 quantization within its dynamic range $[-6.0, 6.0]$. (c) The derivative curves for the three methods, highlighting that hard quantization has a gradient of $f'(x) \equiv 0$ , while STE assumes a constant gradient of $f'(x) \equiv 1$.
  • Figure 4: Visualization of the outlier clamping method, based on the first transformer layer’s output of the LLaMA 1.3B model after 30,000 training iterations. Up: Quantization performed without outlier clamping, leading to severe loss of information. Down: Quantization after applying outlier clamping, effectively preserving tensor structure.
  • Figure 5: Training curves for BF16 models and FP4 models under different model sizes. (a) Training curves for 1.3B LLaMA model. (b) Training curves for 7B LLaMA model. (c) Training curves for 13B LLaMA model.
  • ...and 9 more figures