Characterizing Soft-Error Resiliency in Arm's Ethos-U55 Embedded Machine Learning Accelerator

Abhishek Tyagi; Reiley Jeyapaul; Chuteng Zhu; Paul Whatmough; Yuhao Zhu

Characterizing Soft-Error Resiliency in Arm's Ethos-U55 Embedded Machine Learning Accelerator

Abhishek Tyagi, Reiley Jeyapaul, Chuteng Zhu, Paul Whatmough, Yuhao Zhu

TL;DR

This work evaluates the soft‑error resiliency of Arm's Ethos‑U55 embedded ML accelerator using RTL fault injections to quantify per‑inference SDC under ASIL constraints. It introduces a practical modeling and Monte Carlo methodology to estimate SDC_NPU and analyzes how resiliency varies with MAC size, workload, and technology node, including logic faults. The key finding is that ASIL‑D protection requires selective block‑level protection rather than full duplication; duplicating TSU and WD blocks with DMR and hardening FFs elsewhere can meet ASIL‑D with about 38–53% area overhead depending on node and protection mix. This has direct implications for safety‑critical deployments, enabling designers to balance area, power, and reliability by targeting critical blocks and leveraging mixed protection schemes.

Abstract

As Neural Processing Units (NPU) or accelerators are increasingly deployed in a variety of applications including safety critical applications such as autonomous vehicle, and medical imaging, it is critical to understand the fault-tolerance nature of the NPUs. We present a reliability study of Arm's Ethos-U55, an important industrial-scale NPU being utilised in embedded and IoT applications. We perform large scale RTL-level fault injections to characterize Ethos-U55 against the Automotive Safety Integrity Level D (ASIL-D) resiliency standard commonly used for safety-critical applications such as autonomous vehicles. We show that, under soft errors, all four configurations of the NPU fall short of the required level of resiliency for a variety of neural networks running on the NPU. We show that it is possible to meet the ASIL-D level resiliency without resorting to conventional strategies like Dual Core Lock Step (DCLS) that has an area overhead of 100%. We achieve so through selective protection, where hardware structures are selectively protected (e.g., duplicated, hardened) based on their sensitivity to soft errors and their silicon areas. To identify the optimal configuration that minimizes the area overhead while meeting the ASIL-D standard, the main challenge is the large search space associated with the time-consuming RTL simulation. To address this challenge, we present a statistical analysis tool that is validated against Arm silicon and that allows us to quickly navigate hundreds of billions of fault sites without exhaustive RTL fault injections. We show that by carefully duplicating a small fraction of the functional blocks and hardening the Flops in other blocks meets the ASIL-D safety standard while introducing an area overhead of only 38%.

Characterizing Soft-Error Resiliency in Arm's Ethos-U55 Embedded Machine Learning Accelerator

TL;DR

Abstract

Paper Structure (36 sections, 18 equations, 9 figures, 4 tables)

This paper contains 36 sections, 18 equations, 9 figures, 4 tables.

Introduction
Background
Scope and Assumptions
Soft-Errors
Ethos-U55 Overview
Existing Soft-Error Resilient Approaches
Dual Modular Redundancy (DMR)
Flop Hardening
Ethos-U55 Soft Error Characterization
Fault Injection Setup
Translating SoC FIT Rate to NPU SDC per Inference
How Resilient is Ethos-U55 to Soft Errors?
Factors Shaping Functional Block Resilience
Sensitivity to MAC Sizes
Sensitivity to Applications
...and 21 more sections

Figures (9)

Figure 1: Ethos-U55 functional blocks diagram u55_block_diag
Figure 2: SDC Rate of Ethos-U55 while running ResNet-18, CifarNet, and Wav2Letter at TSMC 16nm technology node.
Figure 3: Functional block SDC contribution for different configurations of Arm Ethos-U55.
Figure 4: Block-wise SDC contribution for different applications running on Arm Ethos-U55 u55 with MAC-32 configuration.
Figure 5: Variation in the reliability of Arms Ethos U55 u55, for TSMC 16 nm and 7 nm technology nodes for MAC-32 configuration running ResNet-18.
...and 4 more figures

Characterizing Soft-Error Resiliency in Arm's Ethos-U55 Embedded Machine Learning Accelerator

TL;DR

Abstract

Characterizing Soft-Error Resiliency in Arm's Ethos-U55 Embedded Machine Learning Accelerator

Authors

TL;DR

Abstract

Table of Contents

Figures (9)