V-ABFT: Variance-Based Adaptive Threshold for Fault-Tolerant Matrix Multiplication in Mixed-Precision Deep Learning

Yiheng Gao; Qin Hua; Zizhong Chen

V-ABFT: Variance-Based Adaptive Threshold for Fault-Tolerant Matrix Multiplication in Mixed-Precision Deep Learning

Yiheng Gao, Qin Hua, Zizhong Chen

TL;DR

The paper addresses the challenge of detecting silent data corruptions in mixed-precision matrix multiplication for deep learning by introducing V-ABFT, a variance-based adaptive threshold method. It directly models the verification difference and uses an extrema-variance bound to estimate variance in O(n) time, achieving markedly tighter thresholds than prior approaches while maintaining zero false positives across BF16, FP16, FP32, and FP64. The method supports online fused-kernel ABFT with FP32-level thresholds, enabling up to ~1000× finer detection granularity than offline verification, and is validated on real models (LLaMA-7B, GPT-2, ViT) with minimal overhead. The solution is platform-agnostic and demonstrated across NPUs and GPUs, offering practical fault tolerance for large-scale mixed-precision DL workloads and informing calibration routines for new hardware.

Abstract

Algorithm-Based Fault Tolerance (ABFT) is widely adopted to detect silent data corruptions (SDCs) in matrix multiplication, a cornerstone operation in deep learning systems. However, existing threshold determination methods face critical challenges: analytical bounds are overly conservative, while probabilistic approaches like A-ABFT yield thresholds $160$--$4200\times$ larger than actual rounding errors. We present V-ABFT, a variance-based adaptive threshold algorithm that achieves tighter error bounds by directly modeling the verification difference. By leveraging statistical variance estimation, V-ABFT reduces the threshold-to-actual-error ratio to approximately $7$--$20\times$ for FP32/FP64 and $48$--$158\times$ for BF16, representing a \textbf{6--48$\times$ improvement} over A-ABFT while maintaining zero false positive rate across BF16, FP16, FP32, and FP64 precisions. Furthermore, we demonstrate that for fused-kernel ABFT implementations that verify before output quantization, low-precision GEMM can use FP32-level thresholds ($e_{\max} \approx 10^{-6}$), enabling \textbf{$\sim$1000$\times$ finer detection granularity} compared to offline verification with low-precision output ($e_{\max} \approx 10^{-3}$). We reproduce A-ABFT's experimental setup and validate our implementation against the original paper's results. Our method requires only $O(n)$ complexity using max/min/mean statistics, compared to A-ABFT's $O(pn)$ for finding $p$ largest values. Extensive experiments on synthetic data and real model weights (LLaMA-7B, GPT-2, ViT) demonstrate V-ABFT's effectiveness across diverse distributions. V-ABFT is platform-agnostic and has been integrated into fault-tolerant GEMM implementations on both NPUs and GPUs.

V-ABFT: Variance-Based Adaptive Threshold for Fault-Tolerant Matrix Multiplication in Mixed-Precision Deep Learning

TL;DR

Abstract

larger than actual rounding errors. We present V-ABFT, a variance-based adaptive threshold algorithm that achieves tighter error bounds by directly modeling the verification difference. By leveraging statistical variance estimation, V-ABFT reduces the threshold-to-actual-error ratio to approximately

for FP32/FP64 and

for BF16, representing a \textbf{6--48

improvement} over A-ABFT while maintaining zero false positive rate across BF16, FP16, FP32, and FP64 precisions. Furthermore, we demonstrate that for fused-kernel ABFT implementations that verify before output quantization, low-precision GEMM can use FP32-level thresholds (

), enabling \textbf{

1000

finer detection granularity} compared to offline verification with low-precision output (

). We reproduce A-ABFT's experimental setup and validate our implementation against the original paper's results. Our method requires only

complexity using max/min/mean statistics, compared to A-ABFT's

for finding

largest values. Extensive experiments on synthetic data and real model weights (LLaMA-7B, GPT-2, ViT) demonstrate V-ABFT's effectiveness across diverse distributions. V-ABFT is platform-agnostic and has been integrated into fault-tolerant GEMM implementations on both NPUs and GPUs.

Paper Structure (60 sections, 1 theorem, 25 equations, 9 tables, 1 algorithm)

This paper contains 60 sections, 1 theorem, 25 equations, 9 tables, 1 algorithm.

Introduction
Background and Motivation
Soft Errors in Deep Learning Systems
ABFT for Matrix Multiplication
Checksum encoding.
Error detection.
Error localization and correction.
Verification difference and threshold.
V-ABFT Algorithm Design
Floating-Point Error Model
Direct Verification Difference Modeling
Statistical Expansion
Physical interpretation.
Threshold Formula Derivation
Efficient Variance Estimation
...and 45 more sections

Key Result

Theorem 1

For any sequence $x_1, x_2, \ldots, x_n$ with mean $\mu$, maximum $m = \max_i x_i$, and minimum $l = \min_i x_i$:

Theorems & Definitions (2)

Theorem 1: Extrema-Variance Bound
proof

V-ABFT: Variance-Based Adaptive Threshold for Fault-Tolerant Matrix Multiplication in Mixed-Precision Deep Learning

TL;DR

Abstract

V-ABFT: Variance-Based Adaptive Threshold for Fault-Tolerant Matrix Multiplication in Mixed-Precision Deep Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (2)