Mitigating multiple single-event upsets during deep neural network inference using fault-aware training
Toon Vinck, Naïn Jonckers, Gert Dekkers, Jeffrey Prinzie, Peter Karsmakers
TL;DR
This paper tackles the reliability of deep neural network inference under multiple single-event upsets by injecting faults into a quantised DNN model and evaluating a fault-aware training (FAT) approach. It presents a PyTorch-based fault injector that simulates bit-flips in the data path during inference and tests on CCDF and MobileNetV2 with datasets MNIST and CIFAR10. The results show that robustness degrades with more faults but FAT can increase tolerance by up to 3×, with 32-bit modules remaining the most sensitive and effectively hardware-protected during FAT. The work demonstrates a practical, software-based mitigation that can significantly improve fault tolerance without changing hardware, aiding safe deployment in harsh environments.
Abstract
Deep neural networks (DNNs) are increasingly used in safety-critical applications. Reliable fault analysis and mitigation are essential to ensure their functionality in harsh environments that contain high radiation levels. This study analyses the impact of multiple single-bit single-event upsets in DNNs by performing fault injection at the level of a DNN model. Additionally, a fault aware training (FAT) methodology is proposed that improves the DNNs' robustness to faults without any modification to the hardware. Experimental results show that the FAT methodology improves the tolerance to faults up to a factor 3.
