Table of Contents
Fetching ...

MOSS: Efficient and Accurate FP8 LLM Training with Microscaling and Automatic Scaling

Yu Zhang, Hui-Ling Zhen, Mingxuan Yuan, Bei Yu

TL;DR

MOSS tackles FP8 training instability and overhead by introducing a two-level microscaling scheme for activations that minimizes dequantization in the GEMM main loop and an automatic scaling method that predicts weight-scale evolution using the bounded updates of AdamW. Together, these innovations yield FP8-based pretraining and fine-tuning performance that matches BF16 accuracy while delivering significant throughput and memory-communication gains. Empirical results on 7B-scale models demonstrate lossless accuracy relative to BF16 and substantial efficiency improvements, supported by ablations and activation-Fidelity analyses. The work offers a practical, hardware-friendly path to scalable FP8 training with broad implications for efficient large-scale model development.

Abstract

Training large language models with FP8 formats offers significant efficiency gains. However, the reduced numerical precision of FP8 poses challenges for stable and accurate training. Current frameworks preserve training performance using mixed-granularity quantization, i.e., applying per-group quantization for activations and per-tensor/block quantization for weights. While effective, per-group quantization requires scaling along the inner dimension of matrix multiplication, introducing additional dequantization overhead. Moreover, these frameworks often rely on just-in-time scaling to dynamically adjust scaling factors based on the current data distribution. However, this online quantization is inefficient for FP8 training, as it involves multiple memory reads and writes that negate the performance benefits of FP8. To overcome these limitations, we propose MOSS, a novel FP8 training framework that ensures both efficiency and numerical stability. MOSS introduces two key innovations: (1) a two-level microscaling strategy for quantizing sensitive activations, which balances precision and dequantization cost by combining a high-precision global scale with compact, power-of-two local scales; and (2) automatic scaling for weights in linear layers, which eliminates the need for costly max-reduction operations by predicting and adjusting scaling factors during training. Leveraging these techniques, MOSS enables efficient FP8 training of a 7B parameter model, achieving performance comparable to the BF16 baseline while achieving up to 34% higher training throughput.

MOSS: Efficient and Accurate FP8 LLM Training with Microscaling and Automatic Scaling

TL;DR

MOSS tackles FP8 training instability and overhead by introducing a two-level microscaling scheme for activations that minimizes dequantization in the GEMM main loop and an automatic scaling method that predicts weight-scale evolution using the bounded updates of AdamW. Together, these innovations yield FP8-based pretraining and fine-tuning performance that matches BF16 accuracy while delivering significant throughput and memory-communication gains. Empirical results on 7B-scale models demonstrate lossless accuracy relative to BF16 and substantial efficiency improvements, supported by ablations and activation-Fidelity analyses. The work offers a practical, hardware-friendly path to scalable FP8 training with broad implications for efficient large-scale model development.

Abstract

Training large language models with FP8 formats offers significant efficiency gains. However, the reduced numerical precision of FP8 poses challenges for stable and accurate training. Current frameworks preserve training performance using mixed-granularity quantization, i.e., applying per-group quantization for activations and per-tensor/block quantization for weights. While effective, per-group quantization requires scaling along the inner dimension of matrix multiplication, introducing additional dequantization overhead. Moreover, these frameworks often rely on just-in-time scaling to dynamically adjust scaling factors based on the current data distribution. However, this online quantization is inefficient for FP8 training, as it involves multiple memory reads and writes that negate the performance benefits of FP8. To overcome these limitations, we propose MOSS, a novel FP8 training framework that ensures both efficiency and numerical stability. MOSS introduces two key innovations: (1) a two-level microscaling strategy for quantizing sensitive activations, which balances precision and dequantization cost by combining a high-precision global scale with compact, power-of-two local scales; and (2) automatic scaling for weights in linear layers, which eliminates the need for costly max-reduction operations by predicting and adjusting scaling factors during training. Leveraging these techniques, MOSS enables efficient FP8 training of a 7B parameter model, achieving performance comparable to the BF16 baseline while achieving up to 34% higher training throughput.

Paper Structure

This paper contains 27 sections, 11 equations, 8 figures, 11 tables.

Figures (8)

  • Figure 1: Quantized GEMM Runtime Comparison on H800.
  • Figure 2: Depiction of MXFP8 data format with level-1 FP32 scale factor and level-2 E8M0 scale factor, with group size $k_1 \sim 10K$ and $k_2 = 32$ over which these two scaling factors are shared.
  • Figure 3: FP8 Quantized GEMM on GPUs. (a) Per-group FP8 GEMM in COAT suffers from significant dequantization overhead in the main loop. (b) In contrast, MOSS achieves faster matrix multiplication by confining the main loop to Tensor Core operations, with all dequantization deferred to the epilogue. Leveraging the fast MXFP8 GEMM module and a two-level microscaling quantization strategy, MOSS significantly reduces dequantization overhead and improves kernel efficiency.
  • Figure 4: Automatic scaling trend under interval = 500.
  • Figure 5: OLMo-7B pretraining curve.
  • ...and 3 more figures