Table of Contents
Fetching ...

DeepVigor+: Scalable and Accurate Semi-Analytical Fault Resilience Analysis for Deep Neural Network

Mohammad Hasan Ahmadilivani, Jaan Raik, Masoud Daneshtalab, Maksim Jenihhin

TL;DR

DeepVigor+ tackles the scalability gap in fault resilience analysis for CNNs by delivering a semi-analytical framework that computes Vulnerability Factors ($VF$s) with high accuracy while drastically reducing simulations. It combines a theoretically grounded fault propagation model for 32-bit floating-point CNNs with a stratified sampling strategy to scale to large architectures, enabling MVF computation in minutes. The approach achieves VF estimation errors below 1% and attains 14.9 to 26.9x fewer simulations than state-of-the-art SFI, outperforming prior semi-analytical methods in speed while preserving accuracy. By providing activation and weight analysis and visualizing layer-level vulnerability, DeepVigor+ supports rapid reliability assessment, design-space exploration, and potential fault-tolerant optimizations for safety-critical DL deployments, and is released as open-source for broad adoption.

Abstract

The growing exploitation of Machine Learning (ML) in safety-critical applications necessitates rigorous safety analysis. Hardware reliability assessment is a major concern with respect to measuring the level of safety in ML-based systems. Quantifying the reliability of emerging ML models, including Convolutional Neural Networks (CNNs), is highly complex due to their enormous size in terms of the number of parameters and computations. Conventionally, Fault Injection (FI) is applied to perform a reliability measurement. However, performing FI on modern-day CNNs is prohibitively time-consuming if an acceptable confidence level is to be achieved. To speed up FI for large CNNs, statistical FI (SFI) has been proposed, but its runtimes are still considerably long. In this work, we introduce DeepVigor+, a scalable, fast, and accurate semi-analytical method as an efficient alternative for reliability measurement in CNNs. DeepVigor+ implements a fault propagation analysis model and attempts to acquire Vulnerability Factors (VFs) as reliability metrics in an optimal way. The results indicate that DeepVigor+ obtains VFs for CNN models with an error less than $1\%$, i.e., the objective in SFI, but with $14.9$ up to $26.9$ times fewer simulations than the best-known state-of-the-art SFI. DeepVigor+ enables an accurate reliability analysis for large and deep CNNs within a few minutes, rather than achieving the same results in days or weeks.

DeepVigor+: Scalable and Accurate Semi-Analytical Fault Resilience Analysis for Deep Neural Network

TL;DR

DeepVigor+ tackles the scalability gap in fault resilience analysis for CNNs by delivering a semi-analytical framework that computes Vulnerability Factors (s) with high accuracy while drastically reducing simulations. It combines a theoretically grounded fault propagation model for 32-bit floating-point CNNs with a stratified sampling strategy to scale to large architectures, enabling MVF computation in minutes. The approach achieves VF estimation errors below 1% and attains 14.9 to 26.9x fewer simulations than state-of-the-art SFI, outperforming prior semi-analytical methods in speed while preserving accuracy. By providing activation and weight analysis and visualizing layer-level vulnerability, DeepVigor+ supports rapid reliability assessment, design-space exploration, and potential fault-tolerant optimizations for safety-critical DL deployments, and is released as open-source for broad adoption.

Abstract

The growing exploitation of Machine Learning (ML) in safety-critical applications necessitates rigorous safety analysis. Hardware reliability assessment is a major concern with respect to measuring the level of safety in ML-based systems. Quantifying the reliability of emerging ML models, including Convolutional Neural Networks (CNNs), is highly complex due to their enormous size in terms of the number of parameters and computations. Conventionally, Fault Injection (FI) is applied to perform a reliability measurement. However, performing FI on modern-day CNNs is prohibitively time-consuming if an acceptable confidence level is to be achieved. To speed up FI for large CNNs, statistical FI (SFI) has been proposed, but its runtimes are still considerably long. In this work, we introduce DeepVigor+, a scalable, fast, and accurate semi-analytical method as an efficient alternative for reliability measurement in CNNs. DeepVigor+ implements a fault propagation analysis model and attempts to acquire Vulnerability Factors (VFs) as reliability metrics in an optimal way. The results indicate that DeepVigor+ obtains VFs for CNN models with an error less than , i.e., the objective in SFI, but with up to times fewer simulations than the best-known state-of-the-art SFI. DeepVigor+ enables an accurate reliability analysis for large and deep CNNs within a few minutes, rather than achieving the same results in days or weeks.

Paper Structure

This paper contains 22 sections, 13 equations, 10 figures, 8 tables, 1 algorithm.

Figures (10)

  • Figure 1: A real-world example of an autonomous vehicle failure, leading to a fatal accident, source: https://shorturl.at/QnOQq.
  • Figure 2: Growing size of emerging DNN models regarding computation and memory requirement yuan2021tokens.
  • Figure 3: 32-bit floating point IEEE-754 data representation.
  • Figure 4: Fault propagation in the case of single bitflip in a filter.
  • Figure 5: An overview of the DeepVigor+ approach for resilience analysis of CNNs.
  • ...and 5 more figures