Table of Contents
Fetching ...

Towards Efficient Pre-training: Exploring FP4 Precision in Large Language Models

Jiecheng Zhou, Ding Tang, Rong Fu, Boni Hu, Haoran Xu, Yi Wang, Zhilin Pei, Zhongling Su, Liang Liu, Xingcheng Zhang, Weiming Zhang

TL;DR

This work addresses the high computational cost of pretraining large language models by proposing a novel $FP4$ mixed-precision scheme with per-block quantization and module-aware strategies. It safeguards critical components (e.g., $QKV$ attention) using FP8 and applies gradient-sensitive FP8 for FFN updates, coupled with a two-stage $2$-stage Target Precious Training Schedule to mitigate quantization noise. Across GPT-2 and LLaMA models up to tens of billions of tokens, FP4 achieves accuracy and downstream task performance comparable to FP16/FP8 while reducing theoretical computation costs by about 30%. The results, supported by ablations on module sensitivity and scheduling, suggest FP4-enabled pretraining as a viable path toward efficient ultra-low-precision training on future hardware.

Abstract

The burgeoning computational demands for training large language models (LLMs) necessitate efficient methods, including quantized training, which leverages low-bit arithmetic operations to reduce costs. While FP8 precision has shown potential, leveraging FP4 remains challenging due to inherent quantization errors and limited representation capability. Based on the Transformer architecture, we present an FP4 training scheme for LLMs, overcoming these obstacles through mixed-precision quantization strategies tailed for different modules and training stages. This allows us to apply the precision level suitable to distinct components within the model, ensuring that multi-head attention and linear layers are handled appropriately. Our pretraining recipe ensures stability in backpropagation by incorporating fine-grained quantization methods with a target precision training schedule. Experimental results demonstrate that our FP4 training scheme achieves accuracy comparable to BF16 and FP8, with smaller theoretical computational cost. With the advent of next-generation hardware supporting FP4, our method sets the foundation for efficient ultra-low precision training.

Towards Efficient Pre-training: Exploring FP4 Precision in Large Language Models

TL;DR

This work addresses the high computational cost of pretraining large language models by proposing a novel mixed-precision scheme with per-block quantization and module-aware strategies. It safeguards critical components (e.g., attention) using FP8 and applies gradient-sensitive FP8 for FFN updates, coupled with a two-stage -stage Target Precious Training Schedule to mitigate quantization noise. Across GPT-2 and LLaMA models up to tens of billions of tokens, FP4 achieves accuracy and downstream task performance comparable to FP16/FP8 while reducing theoretical computation costs by about 30%. The results, supported by ablations on module sensitivity and scheduling, suggest FP4-enabled pretraining as a viable path toward efficient ultra-low-precision training on future hardware.

Abstract

The burgeoning computational demands for training large language models (LLMs) necessitate efficient methods, including quantized training, which leverages low-bit arithmetic operations to reduce costs. While FP8 precision has shown potential, leveraging FP4 remains challenging due to inherent quantization errors and limited representation capability. Based on the Transformer architecture, we present an FP4 training scheme for LLMs, overcoming these obstacles through mixed-precision quantization strategies tailed for different modules and training stages. This allows us to apply the precision level suitable to distinct components within the model, ensuring that multi-head attention and linear layers are handled appropriately. Our pretraining recipe ensures stability in backpropagation by incorporating fine-grained quantization methods with a target precision training schedule. Experimental results demonstrate that our FP4 training scheme achieves accuracy comparable to BF16 and FP8, with smaller theoretical computational cost. With the advent of next-generation hardware supporting FP4, our method sets the foundation for efficient ultra-low precision training.

Paper Structure

This paper contains 13 sections, 8 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: (a) shows the proportion of computational overhead for the main computation components of a transformer block when using the LLaMA 7B configuration with a sequence length of 4K. (b) shows the distribution of activations and gradients after the GPT-large model has been trained to approximately 10B tokens. (c) shows the heatmap of attention scores when using different training strategies. (d) and (e) illustrate our training scheme, which will be detailed in Section \ref{['sec:method']}.
  • Figure 2: Loss curve for the Target Precious Training Schedule.