Table of Contents
Fetching ...

ApproxABFT: Approximate Algorithm-Based Fault Tolerance for Neural Network Processing

Xinghua Xue, Cheng Liu, Feng Min, Tao Luo, Yinhe Han

TL;DR

The paper addresses the high overhead of traditional ABFT in neural networks under soft error conditions, particularly for space-enabled safety-critical systems. It introduces ApproxABFT, a two-phase, threshold-based approximation framework that relaxes exact error detection and recovery, guided by MSD and R/CSD measurements and optimized per-layer thresholds via Bayesian search. A block-partitioning extension (Block-ApproxABFT) further enhances fault isolation and efficiency. Experimental results across multiple architectures show ApproxABFT broadens the effectively protected BER range by an order of magnitude and reduces computing overhead by approximately 43.39% on average, with improved resilience in high-error scenarios, suggesting strong practical impact for space missions and large-scale training with noisy environments.

Abstract

With the increasing deployment of deep neural networks (DNNs) in terrestrial and aerospace safety-critical applications, system reliability has emerged as a co-equal design metric alongside computational efficiency. Algorithm-based fault tolerance (ABFT) mechanisms, characterized by architecture-agnostic and cost-effectiveness, have become a promising solution for reliability enhancement. However, conventional ABFT approaches rely on rigorous verification mechanisms where even minor computational deviations trigger error recovery processes, which not only disregards the intrinsic fault tolerance characteristics of DNN models but also incurs redundant fault tolerance processing overhead. To address these limitations, we propose an Approximate ABFT framework (ApproxABFT) that innovatively introduces adaptive error tolerance thresholds to enable selective fault recovery, activating error correction modules exclusively when computational deviations exceed predefined thresholds. This approach effectively mitigating overreaction to non-critical computational errors. Furthermore, a dynamic block granularity optimization algorithm is implemented to achieve inter-layer error sensitivity balancing. Experimental evaluations demonstrate that the proposed ApproxABFT achieves a 43.39% average reduction in redundant computing overhead compared to previous accurate ABFT, while simultaneously enhancing the tolerable soft error rate by an order of magnitude.

ApproxABFT: Approximate Algorithm-Based Fault Tolerance for Neural Network Processing

TL;DR

The paper addresses the high overhead of traditional ABFT in neural networks under soft error conditions, particularly for space-enabled safety-critical systems. It introduces ApproxABFT, a two-phase, threshold-based approximation framework that relaxes exact error detection and recovery, guided by MSD and R/CSD measurements and optimized per-layer thresholds via Bayesian search. A block-partitioning extension (Block-ApproxABFT) further enhances fault isolation and efficiency. Experimental results across multiple architectures show ApproxABFT broadens the effectively protected BER range by an order of magnitude and reduces computing overhead by approximately 43.39% on average, with improved resilience in high-error scenarios, suggesting strong practical impact for space missions and large-scale training with noisy environments.

Abstract

With the increasing deployment of deep neural networks (DNNs) in terrestrial and aerospace safety-critical applications, system reliability has emerged as a co-equal design metric alongside computational efficiency. Algorithm-based fault tolerance (ABFT) mechanisms, characterized by architecture-agnostic and cost-effectiveness, have become a promising solution for reliability enhancement. However, conventional ABFT approaches rely on rigorous verification mechanisms where even minor computational deviations trigger error recovery processes, which not only disregards the intrinsic fault tolerance characteristics of DNN models but also incurs redundant fault tolerance processing overhead. To address these limitations, we propose an Approximate ABFT framework (ApproxABFT) that innovatively introduces adaptive error tolerance thresholds to enable selective fault recovery, activating error correction modules exclusively when computational deviations exceed predefined thresholds. This approach effectively mitigating overreaction to non-critical computational errors. Furthermore, a dynamic block granularity optimization algorithm is implemented to achieve inter-layer error sensitivity balancing. Experimental evaluations demonstrate that the proposed ApproxABFT achieves a 43.39% average reduction in redundant computing overhead compared to previous accurate ABFT, while simultaneously enhancing the tolerable soft error rate by an order of magnitude.
Paper Structure (16 sections, 5 equations, 12 figures, 3 tables, 1 algorithm)

This paper contains 16 sections, 5 equations, 12 figures, 3 tables, 1 algorithm.

Figures (12)

  • Figure 1: Correlation between accuracy and mean deviation of MSD and R/CSD.
  • Figure 2: MSD and R/CSD of the largest output matrix in each VGG19 layer.
  • Figure 3: The percentage of rows and columns that cannot be recovered by accurate ABFT due to multiple computing errors in VGG19.
  • Figure 4: The architecture of ApproxABFT.
  • Figure 5: Accuracy and computing overhead comparison between ApproxABFT, Block-ApproxABFT, AccurateABFT, and Block-AccurateABFT.
  • ...and 7 more figures