Table of Contents
Fetching ...

SAFFIRA: a Framework for Assessing the Reliability of Systolic-Array-Based DNN Accelerators

Mahdi Taheri, Masoud Daneshtalab, Jaan Raik, Maksim Jenihhin, Salvatore Pappalardo, Paul Jimenez, Bastien Deveautour, Alberto Bosio

TL;DR

SAFFIRA tackles reliability assessment for systolic-array DNN accelerators by introducing a hierarchical software-based, hardware-aware fault-injection flow that uses Uniform Recurrent Equations ($URE$) to model the SA core. The method enables fast, hardware-aware fault injection, supports multiple data representations, and introduces a novel faulty-distance metric to quantify resilience, all implemented as an open-source tool with PyTorch integration. Empirical evaluation on LeNet-5 and larger CNNs demonstrates significant FI-time reductions (up to 3x vs hybrid FI and up to 2000x vs RTL) while preserving accuracy, highlighting practical impact for safety-critical deployments. The work provides a comprehensive framework for reliability analysis of DNN accelerators, including a formalization of fault propagation, a versatile data-path model, and a path toward broader hardware-system integration.

Abstract

Systolic array has emerged as a prominent architecture for Deep Neural Network (DNN) hardware accelerators, providing high-throughput and low-latency performance essential for deploying DNNs across diverse applications. However, when used in safety-critical applications, reliability assessment is mandatory to guarantee the correct behavior of DNN accelerators. While fault injection stands out as a well-established practical and robust method for reliability assessment, it is still a very time-consuming process. This paper addresses the time efficiency issue by introducing a novel hierarchical software-based hardware-aware fault injection strategy tailored for systolic array-based DNN accelerators.

SAFFIRA: a Framework for Assessing the Reliability of Systolic-Array-Based DNN Accelerators

TL;DR

SAFFIRA tackles reliability assessment for systolic-array DNN accelerators by introducing a hierarchical software-based, hardware-aware fault-injection flow that uses Uniform Recurrent Equations () to model the SA core. The method enables fast, hardware-aware fault injection, supports multiple data representations, and introduces a novel faulty-distance metric to quantify resilience, all implemented as an open-source tool with PyTorch integration. Empirical evaluation on LeNet-5 and larger CNNs demonstrates significant FI-time reductions (up to 3x vs hybrid FI and up to 2000x vs RTL) while preserving accuracy, highlighting practical impact for safety-critical deployments. The work provides a comprehensive framework for reliability analysis of DNN accelerators, including a formalization of fault propagation, a versatile data-path model, and a path toward broader hardware-system integration.

Abstract

Systolic array has emerged as a prominent architecture for Deep Neural Network (DNN) hardware accelerators, providing high-throughput and low-latency performance essential for deploying DNNs across diverse applications. However, when used in safety-critical applications, reliability assessment is mandatory to guarantee the correct behavior of DNN accelerators. While fault injection stands out as a well-established practical and robust method for reliability assessment, it is still a very time-consuming process. This paper addresses the time efficiency issue by introducing a novel hierarchical software-based hardware-aware fault injection strategy tailored for systolic array-based DNN accelerators.
Paper Structure (16 sections, 7 equations, 5 figures, 3 tables)

This paper contains 16 sections, 7 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: DNN accelerator hardware reliability threatstaheri2024exploration
  • Figure 2: SAFFIRA methodology
  • Figure 3: LoLif example. Applied transformations are similar to im2col and im2row.
  • Figure 4: When injecting element $s$, the fault is propagated in time (thus affecting elements $s + \delta t_i$ and $s + 2 \delta t_i$) and in space (forwarding the faulty value to neighboring elements $s + \delta x_i + \delta t_i$, $s + 2\delta x_i + \delta t_i$ and so on).
  • Figure 5: Histogram plot of the Faulty distance values