On Approximate 8-bit Floating-Point Operations Using Integer Operations

Theodor Lindberg; Oscar Gustafsson

On Approximate 8-bit Floating-Point Operations Using Integer Operations

Theodor Lindberg, Oscar Gustafsson

TL;DR

This paper investigates approximate eight-bit floating-point (FP8) operations using integer arithmetic for the FP8 formats E5M2 and E4M3, enabling efficient computation in edge devices. It employs a Logarithmic Number System (LNS) representation with Mitchell's approximation to transform FP8 numbers into a sign-magnitude fixed-point in the LNS domain, introducing a format-specific offset $B$ to simplify arithmetic. Through detailed error analysis, the authors derive carry-in expressions that enable faithful or correctly rounded results for multiple operations (multiplication, square, division, reciprocal, square-root, reciprocal square-root) across both formats, with specific constants and cases outlined. Hardware demonstrations on ASIC and FPGA show substantial area and speed savings, particularly for the E4M3 format on FPGA, validating the practical potential of the approach for energy-efficient FP8 processing. The work highlights a viable path to widely deploy FP8 computation in resource-constrained environments and supports its adoption in ML and IoT workloads where low latency and low energy are critical.

Abstract

In this work, approximate eight-bit floating-point operations performed using simple integer operations is discussed. For two-bit mantissa formats, faithful rounding can always be obtained for the considered operations. For all operations, correctly rounded results can be obtained for different rounding modes, either directly or by adding a conditional carry in. For three-bit mantissa formats, faithful rounding can be sometimes be obtained directly, while for other operations a conditional carry in must be added. Correctly rounded results can be obtained for most operations and rounding modes using slightly more complicated expressions for the carry in. Hardware implementation results for multiplication using both standard cell and FPGA technology are presented illustrating the potential benefit of integer computation. Especially for FPGA, significant resource savings are obtained.

On Approximate 8-bit Floating-Point Operations Using Integer Operations

TL;DR

to simplify arithmetic. Through detailed error analysis, the authors derive carry-in expressions that enable faithful or correctly rounded results for multiple operations (multiplication, square, division, reciprocal, square-root, reciprocal square-root) across both formats, with specific constants and cases outlined. Hardware demonstrations on ASIC and FPGA show substantial area and speed savings, particularly for the E4M3 format on FPGA, validating the practical potential of the approach for energy-efficient FP8 processing. The work highlights a viable path to widely deploy FP8 computation in resource-constrained environments and supports its adoption in ML and IoT workloads where low latency and low energy are critical.

Abstract

Paper Structure (22 sections, 52 equations, 9 figures, 4 tables)

This paper contains 22 sections, 52 equations, 9 figures, 4 tables.

Introduction
Approximate Floating-Point Operations in the Logarithmic Domain
Error Analysis and Compensations
E5M2
Multiplication
Square
Division
Reciprocal
Square-Root
Reciprocal Square-Root
E4M3
Multiplication
Square
Division
Reciprocal
...and 7 more sections

Figures (9)

Figure 1: Binary representation of the considered FP8 formats.
Figure 2: Error in ulp for approximate E5M2 multiplication using (\ref{['eq:e5m2-mul']}) compared to the mathematically exact result.
Figure 3: Error in ulp for approximate E5M2 multiplication using (\ref{['eq:e5m2-mul']}) compared to RN$_e$.
Figure 4: Error in ulp for approximate E5M2 division using (\ref{['eq:e5m2-div']}), compared to the mathematically exact result.
Figure 5: Error in ulp for approximate E5M2 division using (\ref{['eq:e5m2-div-dec']}), compared to the mathematically exact result.
...and 4 more figures

On Approximate 8-bit Floating-Point Operations Using Integer Operations

TL;DR

Abstract

On Approximate 8-bit Floating-Point Operations Using Integer Operations

Authors

TL;DR

Abstract

Table of Contents

Figures (9)