Adaptive Soft Error Protection for Neural Network Processing
Xinghua Xue, Cheng Liu, Feng Min, Yinhe Han
TL;DR
This paper addresses soft errors in neural network processing by recognizing that vulnerability varies with input and runtime context. It introduces a lightweight Graph Neural Network to predict per-input and per-layer vulnerability and uses these predictions to drive adaptive protection that can complement traditional fault-tolerant methods like ABFT and TMR. The approach yields substantial overhead reductions (average 42.12%) while preserving reliability across diverse models and datasets, demonstrated on benchmarks including LeNet, AlexNet variants, UC Merced, Caltech101, and YOLO. The work offers a practical path to cost-effective, runtime-aware fault tolerance for neural network accelerators.
Abstract
Mitigating soft errors in neural networks (NNs) often incurs significant computational overhead. Traditional methods mainly explored static vulnerability variations across NN components, employing selective protection to minimize costs. In contrast, this work reveals that NN vulnerability is also input-dependent, exhibiting dynamic variations at runtime. To this end, we propose a lightweight graph neural network (GNN) model capable of capturing input- and component-specific vulnerability to soft errors. This model facilitates runtime vulnerability prediction, enabling an adaptive protection strategy that dynamically adjusts to varying vulnerabilities. The approach complements classical fault-tolerant techniques by tailoring protection efforts based on real-time vulnerability assessments. Experimental results across diverse datasets and NNs demonstrate that our adaptive protection method achieves a 42.12\% average reduction in computational overhead compared to prior static vulnerability-based approaches, without compromising reliability.
