Table of Contents
Fetching ...

A Pilot Study on Tunable Precision Emulation via Automatic BLAS Offloading

Hang Liu, Junjie Li, Yinzhi Wang

TL;DR

This work investigates automatic BLAS offloading combined with INT8-based emulation to accelerate FP64 HPC workloads on modern GPUs using a cache-coherent Unified Memory Architecture. By applying the Ozaki scheme on an IMMU and employing offload tools like ozIMMU and SCILIB-Accel, the study demonstrates tunable precision emulation that preserves original algorithms while enabling improved hardware utilization. Experiments on the MuST suite (MT u56) show that increasing the INT8 split numbers enhances accuracy, with splits 5–6 achieving errors near FP64 for key metrics and splits 7–8 reaching FP64-equivalent accuracy, albeit with performance penalties compared to native FP64. The results suggest a path toward adaptive precision strategies that balance accuracy and performance in HPC, and advocate closer collaboration between hardware developers and scientists to design data types and offloading workflows suited to future AI-accelerated scientific computing.

Abstract

This study explores the use of automatic BLAS offloading and INT8-based emulation for accelerating traditional HPC workloads on modern GPU architectures. Through the use of low-bitwidth integer units and cache-coherent Unified Memory Architecture, we emulate double-precision matrix multiplications in the MuST application without code changes. We find that accuracy depends on both arithmetic precision and the properties of the operator, which can be dealt with through tunable precision emulation. Unlike traditional mixed-precision approaches, this method preserves original algorithms while optimizing hardware utilization. We showcases the potential of improving accuracy and performance at the same time. This work highlights the potential of AI-driven hardware to transform HPC, advocating for adaptive precision strategies in future scientific computing.

A Pilot Study on Tunable Precision Emulation via Automatic BLAS Offloading

TL;DR

This work investigates automatic BLAS offloading combined with INT8-based emulation to accelerate FP64 HPC workloads on modern GPUs using a cache-coherent Unified Memory Architecture. By applying the Ozaki scheme on an IMMU and employing offload tools like ozIMMU and SCILIB-Accel, the study demonstrates tunable precision emulation that preserves original algorithms while enabling improved hardware utilization. Experiments on the MuST suite (MT u56) show that increasing the INT8 split numbers enhances accuracy, with splits 5–6 achieving errors near FP64 for key metrics and splits 7–8 reaching FP64-equivalent accuracy, albeit with performance penalties compared to native FP64. The results suggest a path toward adaptive precision strategies that balance accuracy and performance in HPC, and advocate closer collaboration between hardware developers and scientists to design data types and offloading workflows suited to future AI-accelerated scientific computing.

Abstract

This study explores the use of automatic BLAS offloading and INT8-based emulation for accelerating traditional HPC workloads on modern GPU architectures. Through the use of low-bitwidth integer units and cache-coherent Unified Memory Architecture, we emulate double-precision matrix multiplications in the MuST application without code changes. We find that accuracy depends on both arithmetic precision and the properties of the operator, which can be dealt with through tunable precision emulation. Unlike traditional mixed-precision approaches, this method preserves original algorithms while optimizing hardware utilization. We showcases the potential of improving accuracy and performance at the same time. This work highlights the potential of AI-driven hardware to transform HPC, advocating for adaptive precision strategies in future scientific computing.

Paper Structure

This paper contains 9 sections, 1 figure, 1 table.

Figures (1)

  • Figure 1: Relative error of real(blue) and imaginary(red) parts of $G(z)$ on energy contour(black dots) from 1st iteration using fp64_int8_3 and fp64_int8_5