Impacts of floating-point non-associativity on reproducibility for HPC and deep learning applications

Sanjif Shanmugavelu; Mathieu Taillefumier; Christopher Culver; Oscar Hernandez; Mark Coletti; Ada Sedova

Impacts of floating-point non-associativity on reproducibility for HPC and deep learning applications

Sanjif Shanmugavelu, Mathieu Taillefumier, Christopher Culver, Oscar Hernandez, Mark Coletti, Ada Sedova

TL;DR

An investigation of the statistical properties of floating-point non-associativity within modern parallel programming models, and performance and productivity impacts of replacing atomic operations with deterministic alternatives on GPUs are performed.

Abstract

Run to run variability in parallel programs caused by floating-point non-associativity has been known to significantly affect reproducibility in iterative algorithms, due to accumulating errors. Non-reproducibility can critically affect the efficiency and effectiveness of correctness testing for stochastic programs. Recently, the sensitivity of deep learning training and inference pipelines to floating-point non-associativity has been found to sometimes be extreme. It can prevent certification for commercial applications, accurate assessment of robustness and sensitivity, and bug detection. New approaches in scientific computing applications have coupled deep learning models with high-performance computing, leading to an aggravation of debugging and testing challenges. Here we perform an investigation of the statistical properties of floating-point non-associativity within modern parallel programming models, and analyze performance and productivity impacts of replacing atomic operations with deterministic alternatives on GPUs. We examine the recently-added deterministic options in PyTorch within the context of GPU deployment for deep learning, uncovering and quantifying the impacts of input parameters triggering run to run variability and reporting on the reliability and completeness of the documentation. Finally, we evaluate the strategy of exploiting automatic determinism that could be provided by deterministic hardware, using the Groq accelerator for inference portions of the deep learning pipeline. We demonstrate the benefits that a hardware-based strategy can provide within reproducibility and correctness efforts.

Impacts of floating-point non-associativity on reproducibility for HPC and deep learning applications

TL;DR

Abstract

Paper Structure (17 sections, 2 equations, 5 figures, 8 tables)

This paper contains 17 sections, 2 equations, 5 figures, 8 tables.

Introduction
Metrics for measuring the variability of non-deterministic functions
Scalar-valued outputs
Array outputs
Programming deterministic parallel sums
Examples of deterministic parallel sum implementations on GPUs with CUDA/HIP
Other approaches: Parallel sums with OpenMP
Statistical properties of the variability of non-deterministic parallel sums using CUDA or HIP on different GPUs
Performance comparisons
Non-determinism in PyTorch Functions
Case studies of PyTorch kernels
Effect of non-determinism on full deep learning workflows
GraphSAGE Convolution Network
Results
Conclusions
...and 2 more sections

Figures (5)

Figure 1: Probability density of the variability $V_s$ for sums of 1 M numbers sampled from normal and uniform distributions on the V100 GPU. Kernel parameters are $N_t=64$ and $N_b =7813$ for both and .
Figure 2: of the scalar variability $V_s$ for 1 M FP64 numbers sampled from the uniform distribution $U(0, 10)$ when is used for the nondeterministic implementation, on V100.
Figure 3: Heatmaps of count variability ($V_c$) per run/iteration for 1,000 runs of the non-deterministic implementation of scatter_reduce (left) and index_add (right) for different reduction ratios and input dimensions. Note the input dimension for index_add is two dimensional square arrays, while the input dimension for scatter_reduce is one dimensional.
Figure 4: Plot of the count variability for different reduction ratios for the scatter reduce and index add pytorch kernels. For the scatter reduce kernel we use an array of 2,000 elements, while for the index add we use an array of 100 elements.
Figure 5: Plot of the tensor variability for different reduction ratios for the scatter reduce and index add PyTorch kernels. For the scatter reduce kernel we use an array of 2,000 elements, while for the index add we use an array of 100 elements.

Impacts of floating-point non-associativity on reproducibility for HPC and deep learning applications

TL;DR

Abstract

Impacts of floating-point non-associativity on reproducibility for HPC and deep learning applications

Authors

TL;DR

Abstract

Table of Contents

Figures (5)