FP8-Flow-MoE: A Casting-Free FP8 Recipe without Double Quantization Error

Fengjuan Wang; Zhiyi Su; Xingzhu Hu; Cheng Wang; Mou Sun

FP8-Flow-MoE: A Casting-Free FP8 Recipe without Double Quantization Error

Fengjuan Wang, Zhiyi Su, Xingzhu Hu, Cheng Wang, Mou Sun

TL;DR

This work tackles the prohibitive cost of training large Mixture-of-Experts models by proposing FP8-Flow-MoE, a quantization-consistent FP8 dataflow that avoids double quantization through a scaling-aware transpose. It introduces a casting-free FP8 recipe and a high-performance kernel suite, including fused Permute+Padding and fused SwiGLU+Quantization, enabling end-to-end FP8 computation with only two casts at defined boundaries. Empirical results on a 671B MoE show up to 21% throughput gains and 16.5 GB per-GPU memory reductions while preserving convergence parity with BF16, validated on a 16B model as well. The approach is designed to be plug-and-play with TransformerEngine and Megatron-LM and sets a practical path toward fully low-precision MoE training, with open-source availability planned.

Abstract

Training large Mixture-of-Experts (MoE) models remains computationally prohibitive due to their extreme compute and memory demands. Although low-precision training promises to accelerate computation and reduce memory footprint, existing implementations still rely on BF16-dominated dataflows with frequent quantize-dequantize (Q/DQ) conversions. These redundant casts erode much of FP8's theoretical efficiency. However, naively removing these casts by keeping dataflows entirely in FP8 introduces double quantization error: tensors quantized along different dimensions accumulate inconsistent scaling factors, degrading numerical stability. We propose FP8-Flow-MoE, an FP8 training recipe featuring a quantization-consistent FP8-centric dataflow with a scaling-aware transpose and fused FP8 operators that streamline computation and eliminate explicit cast operations from 12 to 2. Evaluations on a 671B-parameter MoE model demonstrate up to 21\% higher throughput and 16.5 GB lower memory usage per GPU compared to BF16 and naïve FP8 baselines, while maintaining stable convergence. We provide a plug-and-play FP8 recipe compatible with TransformerEngine and Megatron-LM, which will be open-sourced soon.

FP8-Flow-MoE: A Casting-Free FP8 Recipe without Double Quantization Error

TL;DR

Abstract

FP8-Flow-MoE: A Casting-Free FP8 Recipe without Double Quantization Error

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)