Table of Contents
Fetching ...

Analysis of Single Event Induced Bit Faults in a Deep Neural Network Accelerator Pipeline

Naïn Jonckers, Toon Vinck, Peter Karsmakers, Jeffrey Prinzie

TL;DR

This paper analyzes radiation-induced single-bit faults in a 2×2 SA-DNN accelerator using RTL fault-injection to observe fault propagation through the pipeline. It introduces metrics for non-critical and critical errors and evaluates three DNN workloads (3L-fc MNIST, LeNet MNIST, LeNet CIFAR-10), revealing that 32-bit and accumulator registers are the primary conduits for fault propagation, while model redundancy and higher training accuracy mitigate risk. CIFAR-10, with lower accuracy, exhibits the highest susceptibility, underscoring the interplay between hardware faults and model robustness. The study concludes with hardware-software co-design strategies, combining lightweight fault checks and fault-aware training to achieve efficient, fault-tolerant SA-DNN operation in radiation-prone environments.

Abstract

In recent years, the increased interest and the growth in application domains of Artificial Intelligence (AI), and more specifically Deep Neural Networks (DNNs), has led to an extensive usage of domain specific DNN accelerator processors to improve the computational efficiency of DNN inference. However, like any digital circuit, these processors are prone to faults induced by radiation particles such as heavy ions, protons, etc., making their use in harsh radiation environments a challenge. This work presents an in-depth analysis of the impact of such faults on the computational pipeline of a Systolic Array based Deep Neural Network accelerator (SA-DNN accelerator) by means of a Register Transfer Level (RTL) Fault Injection (FI) simulation in order to improve the observability of each hardware block. From this analysis, we present the sensitivity to single bit faults of register groups in the pipeline for three different DNN workloads utilising two datasets, namely MNIST and CIFAR-10. These sensitivity figures are presented in terms of Fault Propagation Probability ($P(f_{non-crit})$) and False Classification Probability ($P(f_{crit})$) which respectively show the probability that an injected fault causes a non-critical error (numerical offset) or a critical error (classification fault). From these results, we devise a fault mitigation strategy to harden the SA-DNN accelerator in an efficient way, both in terms of area and power overhead.

Analysis of Single Event Induced Bit Faults in a Deep Neural Network Accelerator Pipeline

TL;DR

This paper analyzes radiation-induced single-bit faults in a 2×2 SA-DNN accelerator using RTL fault-injection to observe fault propagation through the pipeline. It introduces metrics for non-critical and critical errors and evaluates three DNN workloads (3L-fc MNIST, LeNet MNIST, LeNet CIFAR-10), revealing that 32-bit and accumulator registers are the primary conduits for fault propagation, while model redundancy and higher training accuracy mitigate risk. CIFAR-10, with lower accuracy, exhibits the highest susceptibility, underscoring the interplay between hardware faults and model robustness. The study concludes with hardware-software co-design strategies, combining lightweight fault checks and fault-aware training to achieve efficient, fault-tolerant SA-DNN operation in radiation-prone environments.

Abstract

In recent years, the increased interest and the growth in application domains of Artificial Intelligence (AI), and more specifically Deep Neural Networks (DNNs), has led to an extensive usage of domain specific DNN accelerator processors to improve the computational efficiency of DNN inference. However, like any digital circuit, these processors are prone to faults induced by radiation particles such as heavy ions, protons, etc., making their use in harsh radiation environments a challenge. This work presents an in-depth analysis of the impact of such faults on the computational pipeline of a Systolic Array based Deep Neural Network accelerator (SA-DNN accelerator) by means of a Register Transfer Level (RTL) Fault Injection (FI) simulation in order to improve the observability of each hardware block. From this analysis, we present the sensitivity to single bit faults of register groups in the pipeline for three different DNN workloads utilising two datasets, namely MNIST and CIFAR-10. These sensitivity figures are presented in terms of Fault Propagation Probability () and False Classification Probability () which respectively show the probability that an injected fault causes a non-critical error (numerical offset) or a critical error (classification fault). From these results, we devise a fault mitigation strategy to harden the SA-DNN accelerator in an efficient way, both in terms of area and power overhead.

Paper Structure

This paper contains 5 sections, 2 equations, 3 figures.

Figures (3)

  • Figure 1: Block diagram of the pipeline for a 2$\times$2 parametrisation
  • Figure 2: Simulation framework
  • Figure 3: dnn test results for: (a) 3-Layer fc network with MNIST, (b) LeNET with MNIST and (c) LeNET with CIFAR-10