Resilience of Deep Learning applications: a systematic literature review of analysis and hardening techniques

Cristiana Bolchini; Luca Cassano; Antonio Miele

Resilience of Deep Learning applications: a systematic literature review of analysis and hardening techniques

Cristiana Bolchini, Luca Cassano, Antonio Miele

TL;DR

This systematic review surveys 220 papers published from 2019 to March 2024 on the resilience of deep learning (DL) against hardware faults, clarifying that the focus is on fault tolerance rather than adversarial security. It introduces a comprehensive, multi‑axis classification framework (covering scope, abstraction level, hardware platform, fault/error models, ML framework, tooling and reproducibility, dependability attributes, and hardening techniques/strategies) to enable consistent comparisons across resilience analysis and hardening studies. The review identifies two major strands—resilience analysis and hardening strategies (including redundancy‑based and DL‑specific approaches)—and highlights a growing adoption of cross‑layer methods, fault‑injection tools, and DL‑aware protections. It also emphasizes the need for reproducible research, open benchmarks, and an integrated ecosystem of tools to fairly compare methods and support practical deployment of resilient DL systems. The findings underscore substantial progress and a rich set of open challenges, including standardizing metrics, designing application‑specific protections, and developing scalable, hardware‑aware solutions for DL accelerators.

Abstract

Machine Learning (ML) is currently being exploited in numerous applications being one of the most effective Artificial Intelligence (AI) technologies, used in diverse fields, such as vision, autonomous systems, and alike. The trend motivated a significant amount of contributions to the analysis and design of ML applications against faults affecting the underlying hardware. The authors investigate the existing body of knowledge on Deep Learning (among ML techniques) resilience against hardware faults systematically through a thoughtful review in which the strengths and weaknesses of this literature stream are presented clearly and then future avenues of research are set out. The review is based on 220 scientific articles published between January 2019 and March 2024. The authors adopt a classifying framework to interpret and highlight research similarities and peculiarities, based on several parameters, starting from the main scope of the work, the adopted fault and error models, to their reproducibility. This framework allows for a comparison of the different solutions and the identification of possible synergies. Furthermore, suggestions concerning the future direction of research are proposed in the form of open challenges to be addressed.

Resilience of Deep Learning applications: a systematic literature review of analysis and hardening techniques

TL;DR

Abstract

Paper Structure (30 sections, 7 figures, 8 tables)

This paper contains 30 sections, 7 figures, 8 tables.

Introduction
Methodology
Research design
Research method
Classification framework
Scope.
Abstraction level.
Hardware platform.
Fault model.
Error model.
ML Framework.
Tool support.
Reproducibility.
Dependability attribute.
Injection method.
...and 15 more sections

Figures (7)

Figure 1: Number of contributions on the domain of interest per year, in the considered time frame.
Figure 2: Flow diagram presenting the retrieval and screening process of the literature following the Preferred Reporting Items for Systematic Reviews and Meta‐Analyses (PRISMA) process.
Figure 3: The primary axes of the adopted classification framework, with a few sample values.
Figure 4: Paper organization.
Figure 5: Co-authorship analysis with "authors" as the unit of analysis. In this analysis, the minimum number of documents for each author is 3, and the number of selected authors is 93, grouped in 14 clusters, accordingly. Node size depends on the number of documents and the connecting lines between them indicate the collaboration between authors. The color spectrum represents the average number of citations.
...and 2 more figures

Resilience of Deep Learning applications: a systematic literature review of analysis and hardening techniques

TL;DR

Abstract

Resilience of Deep Learning applications: a systematic literature review of analysis and hardening techniques

Authors

TL;DR

Abstract

Table of Contents

Figures (7)