Table of Contents
Fetching ...

Mitigating multiple single-event upsets during deep neural network inference using fault-aware training

Toon Vinck, Naïn Jonckers, Gert Dekkers, Jeffrey Prinzie, Peter Karsmakers

TL;DR

This paper tackles the reliability of deep neural network inference under multiple single-event upsets by injecting faults into a quantised DNN model and evaluating a fault-aware training (FAT) approach. It presents a PyTorch-based fault injector that simulates bit-flips in the data path during inference and tests on CCDF and MobileNetV2 with datasets MNIST and CIFAR10. The results show that robustness degrades with more faults but FAT can increase tolerance by up to 3×, with 32-bit modules remaining the most sensitive and effectively hardware-protected during FAT. The work demonstrates a practical, software-based mitigation that can significantly improve fault tolerance without changing hardware, aiding safe deployment in harsh environments.

Abstract

Deep neural networks (DNNs) are increasingly used in safety-critical applications. Reliable fault analysis and mitigation are essential to ensure their functionality in harsh environments that contain high radiation levels. This study analyses the impact of multiple single-bit single-event upsets in DNNs by performing fault injection at the level of a DNN model. Additionally, a fault aware training (FAT) methodology is proposed that improves the DNNs' robustness to faults without any modification to the hardware. Experimental results show that the FAT methodology improves the tolerance to faults up to a factor 3.

Mitigating multiple single-event upsets during deep neural network inference using fault-aware training

TL;DR

This paper tackles the reliability of deep neural network inference under multiple single-event upsets by injecting faults into a quantised DNN model and evaluating a fault-aware training (FAT) approach. It presents a PyTorch-based fault injector that simulates bit-flips in the data path during inference and tests on CCDF and MobileNetV2 with datasets MNIST and CIFAR10. The results show that robustness degrades with more faults but FAT can increase tolerance by up to 3×, with 32-bit modules remaining the most sensitive and effectively hardware-protected during FAT. The work demonstrates a practical, software-based mitigation that can significantly improve fault tolerance without changing hardware, aiding safe deployment in harsh environments.

Abstract

Deep neural networks (DNNs) are increasingly used in safety-critical applications. Reliable fault analysis and mitigation are essential to ensure their functionality in harsh environments that contain high radiation levels. This study analyses the impact of multiple single-bit single-event upsets in DNNs by performing fault injection at the level of a DNN model. Additionally, a fault aware training (FAT) methodology is proposed that improves the DNNs' robustness to faults without any modification to the hardware. Experimental results show that the FAT methodology improves the tolerance to faults up to a factor 3.

Paper Structure

This paper contains 5 sections, 2 equations, 4 figures.

Figures (4)

  • Figure 1: Representation of the tool applied to one quantised DNN layer vinck2024understanding.
  • Figure 2: Relation between accuracy and the number of faults and fault rate, for the CCDF and MobileNetV2 experiment. The error bands represent a 95% confidence interval.
  • Figure 3: Accuracy and fault rate relationship for the CCDF and MobileNetV2 experiment when faults are injected in only one module at a time. The error bars represent a 95% confidence interval.
  • Figure 4: Comparison between the robustness of the model and reference model. The error bands represent a 95% confidence interval.