Table of Contents
Fetching ...

Hardware Trends Impacting Floating-Point Computations In Scientific Applications

Jack Dongarra, John Gunnels, Harun Bayraktar, Azzam Haidar, Dan Ernst

TL;DR

The paper surveys the historical and evolving landscape of floating-point computation in scientific applications, tracing the shift from software emulation to dedicated co-processors, integrated FPUs, and the GPU revolution. It highlights modern trends toward reduced precision and mixed-precision computing driven by AI workloads, and discusses emulation as a flexible approach to extend hardware capabilities while managing energy and performance. Through benchmarks (e.g., HPL, Green500, HPCG, HPL-MxP) and architectural developments in heterogeneous computing and Tensor Cores, the work maps how hardware advances shape software, instruction sets, and paradigms. The key takeaway is that future progress will rely on tightly integrated, energy-efficient, and dynamically adaptable FP architectures that support both AI efficiency and the stringent accuracy needs of scientific computing. This interplay between precision, performance, and power will guide hardware-software co-design and the adoption of non-standard data types and emulation strategies in next-generation scientific and AI systems.

Abstract

The evolution of floating-point computation has been shaped by algorithmic advancements, architectural innovations, and the increasing computational demands of modern technologies, such as artificial intelligence (AI) and high-performance computing (HPC). This paper examines the historical progression of floating-point computation in scientific applications and contextualizes recent trends driven by AI, particularly the adoption of reduced-precision floating-point types. The challenges posed by these trends, including the trade-offs between performance, efficiency, and precision, are discussed, as are innovations in mixed-precision computing and emulation algorithms that offer solutions to these challenges. This paper also explores architectural shifts, including the role of specialized and general-purpose hardware, and how these trends will influence future advancements in scientific computing, energy efficiency, and system design.

Hardware Trends Impacting Floating-Point Computations In Scientific Applications

TL;DR

The paper surveys the historical and evolving landscape of floating-point computation in scientific applications, tracing the shift from software emulation to dedicated co-processors, integrated FPUs, and the GPU revolution. It highlights modern trends toward reduced precision and mixed-precision computing driven by AI workloads, and discusses emulation as a flexible approach to extend hardware capabilities while managing energy and performance. Through benchmarks (e.g., HPL, Green500, HPCG, HPL-MxP) and architectural developments in heterogeneous computing and Tensor Cores, the work maps how hardware advances shape software, instruction sets, and paradigms. The key takeaway is that future progress will rely on tightly integrated, energy-efficient, and dynamically adaptable FP architectures that support both AI efficiency and the stringent accuracy needs of scientific computing. This interplay between precision, performance, and power will guide hardware-software co-design and the adoption of non-standard data types and emulation strategies in next-generation scientific and AI systems.

Abstract

The evolution of floating-point computation has been shaped by algorithmic advancements, architectural innovations, and the increasing computational demands of modern technologies, such as artificial intelligence (AI) and high-performance computing (HPC). This paper examines the historical progression of floating-point computation in scientific applications and contextualizes recent trends driven by AI, particularly the adoption of reduced-precision floating-point types. The challenges posed by these trends, including the trade-offs between performance, efficiency, and precision, are discussed, as are innovations in mixed-precision computing and emulation algorithms that offer solutions to these challenges. This paper also explores architectural shifts, including the role of specialized and general-purpose hardware, and how these trends will influence future advancements in scientific computing, energy efficiency, and system design.

Paper Structure

This paper contains 34 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Various floating-point (FP) representations used today in scientific computing and AI. The exponent bits determine the dynamic range of the FP number, while the mantissa bits determine the precision. Four IEEE FP types are shown: half (FP16), single (FP32), double (FP64), and quad (FP128). TensorFloat-32 (TF32), available on NVIDIA GPUs starting with the Ampere architecture, is a Tensor Core matrix multiply compute mode where input and output are FP32, but input operands are truncated. Bfloat16 (BF16), which was introduced by Google kalamkar2019studybfloat16deeplearning, has the same range as FP32 at the expense of mantissa bits. Two variants of FP8, with different splits of exponent and mantissa bits micikevicius2022fp8formatsdeeplearning are shown.
  • Figure 2: Historical record of the advances made in transistor density and process efficiency sjprocessdataieee, contrasted with the increases seen in the TOP500 and Green500 lists since 2013. All series are normalized to 1.0 at the outset of the chart in 2013. It is notable that both the TOP500 and Green500 entries have improved at a far greater rate than process technology and, further, that the TOP500 (performance) has increased at a greater rate than the Green500 (efficiency).
  • Figure 3: Ratio of HPL-MxP (formerly HPL-AI) $R_{max}$ to HPL $R_{max}$ over time, since the inception of the HPL-MxP benchmark. Some top-ranked Supercomputers are highlighted with colors and labels. The bubble size is inversely proportional to the Top500 HPL ranking of that particular supercomputer, which explains why Summit or Fugaku bubbles shrink over time. In addition to the general trend of the ratio increasing over time for top-ranked systems, implementation optimizations also increase the speed-up over time, as illustrated by the Summit and Fontier systems.
  • Figure 4: Representative power consumption curves measured on an NVIDIA Ampere A100 GPU during the execution of two different equations solvers is shown. The blue line shows the FP64 LU solver (corresponds to ZGETRF & ZGETRS in LAPACK) while the green line shows the Tensor Core accelerated mixed-precision iterative refinement solver available in the cuSOLVER library, which relies on cuBLAS for Level 3 BLAS operations, both for a matrix size of 32000. Comparison with other GPU architectures can be found in Table \ref{['tab:mxp_perf']}.
  • Figure 5: Comparison of Bytes/FLOP across four generations of GPUs for FMA and Tensor Core throughput from Table \ref{['table:hw_specs']}. The dashed line shows Tensor Core accelerated DGEMM performance using integer based emulation with 7 slices (INT8 data storage elements) uchino2024performanceenhancementozakischeme.