RepDL: Bit-level Reproducible Deep Learning Training and Inference
Peichen Xie, Xian Zhang, Shuo Chen
TL;DR
The paper addresses non-determinism and non-reproducibility in deep learning, caused by RNG variability and floating-point computation across hardware and software. It introduces RepDL, a PyTorch-compatible library that ensures bitwise reproducibility by enforcing IEEE-754 compliant correct rounding for basic operations and order-invariant computation for reductions and composite functions. Key implementations include correctly rounded basic ops, fixed-order summations for FC and convolution layers, explicit computation graphs to lock operation order, and controlled compilation with FMA contraction. The work provides a practical path toward reliable model development and deployment across CPU/GPU environments, with open-source release and future improvements including performance optimizations and low-precision support.
Abstract
Non-determinism and non-reproducibility present significant challenges in deep learning, leading to inconsistent results across runs and platforms. These issues stem from two origins: random number generation and floating-point computation. While randomness can be controlled through deterministic configurations, floating-point inconsistencies remain largely unresolved. To address this, we introduce RepDL, an open-source library that ensures deterministic and bitwise-reproducible deep learning training and inference across diverse computing environments. RepDL achieves this by enforcing correct rounding and order invariance in floating-point computation. The source code is available at https://github.com/microsoft/RepDL .
