An SMT Formalization of Mixed-Precision Matrix Multiplication: Modeling Three Generations of Tensor Cores
Benjamin Valpey, Xinyi Li, Sreepathi Pai, Ganesh Gopalakrishnan
TL;DR
The paper develops a formal SMT-based model of Nvidia tensor cores across Volta, Turing, and Ampere to capture precision, rounding, accumulation order, and carry-out behavior, enabling automatic generation of discriminating inputs and architecture-aware analysis. It corrects prior characterizations (notably, accumulation is FP32 and final FP16 rounding is to nearest) and reveals that Volta/Turing require 3 extra carry-out bits for 5-term accumulation, with Ampere likely needing 4 bits for larger accumulations. By encoding two mixed-precision error-correction schemes (Markidis and Ootomo & Yokota) in SMT, the work demonstrates that faster methods are not universally more accurate due to the tensor cores' non-normalized intermediate sums. The approach yields executable hardware-focused models and supports the development of portable simulators and robust algorithm design for future non-standard architectures.
Abstract
Many recent computational accelerators provide non-standard (e.g., reduced precision) arithmetic operations to enhance performance for floating-point matrix multiplication. Unfortunately, the properties of these accelerators are not widely understood and lack sufficient descriptions of their behavior. This makes it difficult for tool builders beyond the original vendor to target or simulate the hardware correctly, or for algorithm designers to be confident in their code. To address these gaps, prior studies have probed the behavior of these units with manually crafted tests. Such tests are cumbersome to design, and adapting them as the accelerators evolve requires repeated manual effort. We present a formal model for the tensor cores of Nvidia's Volta, Turing, and Ampere GPUs. We identify specific properties -- rounding mode, precision, and accumulation order -- that drive these cores' behavior. We formalize these properties and then use the formalization to automatically generate discriminating inputs that illustrate differences among machines. Our results confirm many of the findings of previous tensor core studies, but also identify subtle disagreements. In particular, Nvidia's machines do not, as previously reported, use round-to-zero for accumulation, and their 5-term accumulator requires 3 extra carry-out bits for full accuracy. Using our formal model, we analyze two existing algorithms that use half-precision tensor cores to accelerate single-precision multiplication with error correction. Our analysis reveals that the newer algorithm, designed to be more accurate than the first, is actually less accurate for certain inputs.
