Table of Contents
Fetching ...

Testing GPU Numerics: Finding Numerical Differences Between NVIDIA and AMD GPUs

Anwar Hossain Zahid, Ignacio Laguna, Wei Le

TL;DR

This study presents a study of compiler-induced numerical differences between NVIDIA and AMD GPUs, two widely used GPUs in HPC clusters and finds that some of the differences come from math library calls, differences in floating-point precision, and converting code to HIP with HIPIFY.

Abstract

As scientific codes are ported between GPU platforms, continuous testing is required to ensure numerical robustness and identify numerical differences. Compiler-induced numerical differences occur when a program is compiled and run on different GPUs, and the numerical outcomes are different for the same input. We present a study of compiler-induced numerical differences between NVIDIA and AMD GPUs. Our approach uses Varity to generate thousands of short numerical tests in CUDA and HIP, and their inputs; then, we use differential testing to check if the program produced a numerical inconsistency when run on these GPUs. We also use the HIPIFY tool to convert CUDA tests into HIP and check if there are numerical inconsistencies induced by HIPIFY. We generated more than 600,000 tests and found subtle numerical differences that come from (1) math library calls, (2) differences in floating-point precision (FP64 versus FP32), and (3) converting code with HIPIFY.

Testing GPU Numerics: Finding Numerical Differences Between NVIDIA and AMD GPUs

TL;DR

This study presents a study of compiler-induced numerical differences between NVIDIA and AMD GPUs, two widely used GPUs in HPC clusters and finds that some of the differences come from math library calls, differences in floating-point precision, and converting code to HIP with HIPIFY.

Abstract

As scientific codes are ported between GPU platforms, continuous testing is required to ensure numerical robustness and identify numerical differences. Compiler-induced numerical differences occur when a program is compiled and run on different GPUs, and the numerical outcomes are different for the same input. We present a study of compiler-induced numerical differences between NVIDIA and AMD GPUs. Our approach uses Varity to generate thousands of short numerical tests in CUDA and HIP, and their inputs; then, we use differential testing to check if the program produced a numerical inconsistency when run on these GPUs. We also use the HIPIFY tool to convert CUDA tests into HIP and check if there are numerical inconsistencies induced by HIPIFY. We generated more than 600,000 tests and found subtle numerical differences that come from (1) math library calls, (2) differences in floating-point precision (FP64 versus FP32), and (3) converting code with HIPIFY.

Paper Structure

This paper contains 31 sections, 1 equation, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Overview of testing approach via random program generation for both GPUs (NVIDIA and AMD).
  • Figure 2: Example of a simple test random program in FP64 precision.
  • Figure 3: Process to perform between-platform comparisons.
  • Figure 4: Small Numerical Variation with No Optimization (-O0)
  • Figure 5: Numerical variation with Infinity Without Optimization (-O0)
  • ...and 1 more figures