Table of Contents
Fetching ...

InfiR2: A Comprehensive FP8 Training Recipe for Reasoning-Enhanced Language Models

Wenjun Wang, Shuo Cai, Congkai Xie, Mingfa Feng, Yiming Zhang, Zhen Li, Kejing Yang, Ming Li, Jiannong Cao, Hongxia Yang

TL;DR

This work targets the high computational cost of training large language models by proposing an end-to-end FP8 training recipe that combines continual pretraining and supervised fine-tuning. The core method is a hybrid-granularity quantization strategy that uses per-block quantization for weights and per-token quantization for activations, while keeping critical components in FP32 to preserve precision. Empirically, FP8 training demonstrates stability and near-lossless fidelity to BF16 across 160B-token pretraining and subsequent SFT, with InfiR2-1.5B-FP8 and InfiR2-7B-FP8 achieving competitive or superior reasoning benchmark performance (e.g., AIME24, GPQA) and substantial efficiency gains (up to 22% faster training, 14% less memory, 19% higher throughput). The results establish FP8 as a practical alternative to BF16 for scalable LLM training, and the authors publish their code and intermediate artifacts to democratize access to FP8 training.

Abstract

The immense computational cost of training Large Language Models (LLMs) presents a major barrier to innovation. While FP8 training offers a promising solution with significant theoretical efficiency gains, its widespread adoption has been hindered by the lack of a comprehensive, open-source training recipe. To bridge this gap, we introduce an end-to-end FP8 training recipe that seamlessly integrates continual pre-training and supervised fine-tuning. Our methodology employs a fine-grained, hybrid-granularity quantization strategy to maintain numerical fidelity while maximizing computational efficiency. Through extensive experiments, including the continue pre-training of models on a 160B-token corpus, we demonstrate that our recipe is not only remarkably stable but also essentially lossless, achieving performance on par with the BF16 baseline across a suite of reasoning benchmarks. Crucially, this is achieved with substantial efficiency improvements, including up to a 22% reduction in training time, a 14% decrease in peak memory usage, and a 19% increase in throughput. Our results establish FP8 as a practical and robust alternative to BF16, and we will release the accompanying code to further democratize large-scale model training.

InfiR2: A Comprehensive FP8 Training Recipe for Reasoning-Enhanced Language Models

TL;DR

This work targets the high computational cost of training large language models by proposing an end-to-end FP8 training recipe that combines continual pretraining and supervised fine-tuning. The core method is a hybrid-granularity quantization strategy that uses per-block quantization for weights and per-token quantization for activations, while keeping critical components in FP32 to preserve precision. Empirically, FP8 training demonstrates stability and near-lossless fidelity to BF16 across 160B-token pretraining and subsequent SFT, with InfiR2-1.5B-FP8 and InfiR2-7B-FP8 achieving competitive or superior reasoning benchmark performance (e.g., AIME24, GPQA) and substantial efficiency gains (up to 22% faster training, 14% less memory, 19% higher throughput). The results establish FP8 as a practical alternative to BF16 for scalable LLM training, and the authors publish their code and intermediate artifacts to democratize access to FP8 training.

Abstract

The immense computational cost of training Large Language Models (LLMs) presents a major barrier to innovation. While FP8 training offers a promising solution with significant theoretical efficiency gains, its widespread adoption has been hindered by the lack of a comprehensive, open-source training recipe. To bridge this gap, we introduce an end-to-end FP8 training recipe that seamlessly integrates continual pre-training and supervised fine-tuning. Our methodology employs a fine-grained, hybrid-granularity quantization strategy to maintain numerical fidelity while maximizing computational efficiency. Through extensive experiments, including the continue pre-training of models on a 160B-token corpus, we demonstrate that our recipe is not only remarkably stable but also essentially lossless, achieving performance on par with the BF16 baseline across a suite of reasoning benchmarks. Crucially, this is achieved with substantial efficiency improvements, including up to a 22% reduction in training time, a 14% decrease in peak memory usage, and a 19% increase in throughput. Our results establish FP8 as a practical and robust alternative to BF16, and we will release the accompanying code to further democratize large-scale model training.

Paper Structure

This paper contains 20 sections, 2 equations, 4 figures, 6 tables, 1 algorithm.

Figures (4)

  • Figure 1: An illustration of three common quantization granularities: per-tensor, per-block, and per-token. The tensor has a shape of [s, d], where s is the context length and d is the dimension. bs represents the block size.
  • Figure 2: An illustration of a hybrid granularity quantization strategy using FP8, compared to a standard BF16 pipeline. In the FP8 pipeline, different quantization methods are applied: per-tensor quantization for weights (purple), and per-block quantization for activations (blue). The diagram shows the complete training process, including forward propagation (FProp), weight gradient calculation (Wgrad), and input gradient calculation (Dgrad), along with a detailed view of the FProp workflow.
  • Figure 3: The FP8 training loss of InfiR2-1.5B and InfiR2-7B.
  • Figure 4: The validation loss and training loss of Continue Pretraining Qwen2.5-1.5B-base comparing FP8 and BF16.