Table of Contents
Fetching ...

A Metric Driven Approach to Mixed Precision Training

Mitchelle Rasquinha, Gil Tabak

TL;DR

This work tackles the high resource demands of training large neural networks by introducing a metric-driven framework for selecting mixed-precision formats, focusing on FP8 and INT8 with dynamic scaling for training. It formalizes quantization via uniform integer quantization and per-operand scaling, derives key equations, and analyzes the trade-offs among INT8, E4M3, and E5M2 formats. Through experiments on BERT, it shows that weights can be quantized losslessly with INT8, FP8 formats can converge with appropriate rounding, while activations and gradients require finer granularity and specialized rounding to preserve quality; the framework provides a principled approach to navigate the search space. The results suggest that metric-guided mixed precision can enable scalable training on accelerators while generalizing beyond the tested model.

Abstract

As deep learning methodologies have developed, it has been generally agreed that increasing neural network size improves model quality. However, this is at the expense of memory and compute requirements, which also need to be increased. Various efficiency techniques have been proposed to rein in hardware costs, one being the use of low precision numerics. Recent accelerators have introduced several different 8-bit data types to help accommodate DNNs in terms of numerics. In this paper, we identify a metric driven methodology to aid in the choice of numerics. We demonstrate how such a methodology can help scale training of a language representation model. The technique can be generalized to other model architectures.

A Metric Driven Approach to Mixed Precision Training

TL;DR

This work tackles the high resource demands of training large neural networks by introducing a metric-driven framework for selecting mixed-precision formats, focusing on FP8 and INT8 with dynamic scaling for training. It formalizes quantization via uniform integer quantization and per-operand scaling, derives key equations, and analyzes the trade-offs among INT8, E4M3, and E5M2 formats. Through experiments on BERT, it shows that weights can be quantized losslessly with INT8, FP8 formats can converge with appropriate rounding, while activations and gradients require finer granularity and specialized rounding to preserve quality; the framework provides a principled approach to navigate the search space. The results suggest that metric-guided mixed precision can enable scalable training on accelerators while generalizing beyond the tested model.

Abstract

As deep learning methodologies have developed, it has been generally agreed that increasing neural network size improves model quality. However, this is at the expense of memory and compute requirements, which also need to be increased. Various efficiency techniques have been proposed to rein in hardware costs, one being the use of low precision numerics. Recent accelerators have introduced several different 8-bit data types to help accommodate DNNs in terms of numerics. In this paper, we identify a metric driven methodology to aid in the choice of numerics. We demonstrate how such a methodology can help scale training of a language representation model. The technique can be generalized to other model architectures.
Paper Structure (8 sections, 3 equations, 4 figures, 1 table)

This paper contains 8 sections, 3 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Illustration of the neural network graph modification during quantization.
  • Figure 2: Distributions of the input tensors to the query projection dot operation in the forward and backward passes. Red text denotes the tensor type.
  • Figure 3: A comparison of the relative error profile of INT8 and two FP8 formats, assuming RTNE.
  • Figure 4: Student's T-Distribution of the quantization error when INT8, E4M3 and E5M2 were each used as input data types.