Table of Contents
Fetching ...

Adaptive Soft Error Protection for Neural Network Processing

Xinghua Xue, Cheng Liu, Feng Min, Yinhe Han

TL;DR

This paper addresses soft errors in neural network processing by recognizing that vulnerability varies with input and runtime context. It introduces a lightweight Graph Neural Network to predict per-input and per-layer vulnerability and uses these predictions to drive adaptive protection that can complement traditional fault-tolerant methods like ABFT and TMR. The approach yields substantial overhead reductions (average 42.12%) while preserving reliability across diverse models and datasets, demonstrated on benchmarks including LeNet, AlexNet variants, UC Merced, Caltech101, and YOLO. The work offers a practical path to cost-effective, runtime-aware fault tolerance for neural network accelerators.

Abstract

Mitigating soft errors in neural networks (NNs) often incurs significant computational overhead. Traditional methods mainly explored static vulnerability variations across NN components, employing selective protection to minimize costs. In contrast, this work reveals that NN vulnerability is also input-dependent, exhibiting dynamic variations at runtime. To this end, we propose a lightweight graph neural network (GNN) model capable of capturing input- and component-specific vulnerability to soft errors. This model facilitates runtime vulnerability prediction, enabling an adaptive protection strategy that dynamically adjusts to varying vulnerabilities. The approach complements classical fault-tolerant techniques by tailoring protection efforts based on real-time vulnerability assessments. Experimental results across diverse datasets and NNs demonstrate that our adaptive protection method achieves a 42.12\% average reduction in computational overhead compared to prior static vulnerability-based approaches, without compromising reliability.

Adaptive Soft Error Protection for Neural Network Processing

TL;DR

This paper addresses soft errors in neural network processing by recognizing that vulnerability varies with input and runtime context. It introduces a lightweight Graph Neural Network to predict per-input and per-layer vulnerability and uses these predictions to drive adaptive protection that can complement traditional fault-tolerant methods like ABFT and TMR. The approach yields substantial overhead reductions (average 42.12%) while preserving reliability across diverse models and datasets, demonstrated on benchmarks including LeNet, AlexNet variants, UC Merced, Caltech101, and YOLO. The work offers a practical path to cost-effective, runtime-aware fault tolerance for neural network accelerators.

Abstract

Mitigating soft errors in neural networks (NNs) often incurs significant computational overhead. Traditional methods mainly explored static vulnerability variations across NN components, employing selective protection to minimize costs. In contrast, this work reveals that NN vulnerability is also input-dependent, exhibiting dynamic variations at runtime. To this end, we propose a lightweight graph neural network (GNN) model capable of capturing input- and component-specific vulnerability to soft errors. This model facilitates runtime vulnerability prediction, enabling an adaptive protection strategy that dynamically adjusts to varying vulnerabilities. The approach complements classical fault-tolerant techniques by tailoring protection efforts based on real-time vulnerability assessments. Experimental results across diverse datasets and NNs demonstrate that our adaptive protection method achieves a 42.12\% average reduction in computational overhead compared to prior static vulnerability-based approaches, without compromising reliability.
Paper Structure (12 sections, 9 figures, 1 table)

This paper contains 12 sections, 9 figures, 1 table.

Figures (9)

  • Figure 1: Vulnerability variations across different inputs.
  • Figure 2: The proposed adaptive fault-tolerant design framework. It leverages a GNN model to predict the NN vulnerability to soft errors. The prediction is further utilized to decide if an NN layer is vulnerable to errors and requires intensive protection at runtime.
  • Figure 3: An example of graph representation.
  • Figure 4: Model accuracy comparison between different fault-tolerant design strategies in presence of various fault injection setups.
  • Figure 5: Fault-tolerant design overhead comparison between different fault-tolerant design strategies in presence of various fault injection setups.
  • ...and 4 more figures