Table of Contents
Fetching ...

PVF (Parameter Vulnerability Factor): A Scalable Metric for Understanding AI Vulnerability Against SDCs in Model Parameters

Xun Jiao, Fred Lin, Harish D. Dixit, Joel Coburn, Abhinav Pandey, Han Wang, Venkat Ramesh, Jianyu Huang, Wang Xu, Daniel Moore, Sriram Sankar

TL;DR

This work introduces the Parameter Vulnerability Factor ($PVF$), a parameter-level metric to quantify AI vulnerability to silent data corruptions in model parameters. Defined as $PVF = D/N$, it is estimated via large-scale fault-injection experiments across fault models such as Single-Bit Flip ($SBF$), Multiple Bit Flip ($MBF$), and Burst Bit Flip ($MBBF$), and validated through case studies on DLRM, LeNet, and Tiny BERT. The results reveal distinct vulnerability patterns across parameter components and model architectures, informing targeted fault protection and hardware–software co-design, while supporting standardization of resilience evaluation. The framework is extensible to training-time faults and diverse AI architectures, offering a practical, scalable tool for designing reliable AI systems and guiding hardware mapping decisions.

Abstract

Reliability of AI systems is a fundamental concern for the successful deployment and widespread adoption of AI technologies. Unfortunately, the escalating complexity and heterogeneity of AI hardware systems make them increasingly susceptible to hardware faults, e.g., silent data corruptions (SDC), that can potentially corrupt model parameters. When this occurs during AI inference/servicing, it can potentially lead to incorrect or degraded model output for users, ultimately affecting the quality and reliability of AI services. In light of the escalating threat, it is crucial to address key questions: How vulnerable are AI models to parameter corruptions, and how do different components (such as modules, layers) of the models exhibit varying vulnerabilities to parameter corruptions? To systematically address this question, we propose a novel quantitative metric, Parameter Vulnerability Factor (PVF), inspired by architectural vulnerability factor (AVF) in computer architecture community, aiming to standardize the quantification of AI model vulnerability against parameter corruptions. We define a model parameter's PVF as the probability that a corruption in that particular model parameter will result in an incorrect output. In this paper, we present several use cases on applying PVF to three types of tasks/models during inference -- recommendation (DLRM), vision classification (CNN), and text classification (BERT), while presenting an in-depth vulnerability analysis on DLRM. PVF can provide pivotal insights to AI hardware designers in balancing the tradeoff between fault protection and performance/efficiency such as mapping vulnerable AI parameter components to well-protected hardware modules. PVF metric is applicable to any AI model and has a potential to help unify and standardize AI vulnerability/resilience evaluation practice.

PVF (Parameter Vulnerability Factor): A Scalable Metric for Understanding AI Vulnerability Against SDCs in Model Parameters

TL;DR

This work introduces the Parameter Vulnerability Factor (), a parameter-level metric to quantify AI vulnerability to silent data corruptions in model parameters. Defined as , it is estimated via large-scale fault-injection experiments across fault models such as Single-Bit Flip (), Multiple Bit Flip (), and Burst Bit Flip (), and validated through case studies on DLRM, LeNet, and Tiny BERT. The results reveal distinct vulnerability patterns across parameter components and model architectures, informing targeted fault protection and hardware–software co-design, while supporting standardization of resilience evaluation. The framework is extensible to training-time faults and diverse AI architectures, offering a practical, scalable tool for designing reliable AI systems and guiding hardware mapping decisions.

Abstract

Reliability of AI systems is a fundamental concern for the successful deployment and widespread adoption of AI technologies. Unfortunately, the escalating complexity and heterogeneity of AI hardware systems make them increasingly susceptible to hardware faults, e.g., silent data corruptions (SDC), that can potentially corrupt model parameters. When this occurs during AI inference/servicing, it can potentially lead to incorrect or degraded model output for users, ultimately affecting the quality and reliability of AI services. In light of the escalating threat, it is crucial to address key questions: How vulnerable are AI models to parameter corruptions, and how do different components (such as modules, layers) of the models exhibit varying vulnerabilities to parameter corruptions? To systematically address this question, we propose a novel quantitative metric, Parameter Vulnerability Factor (PVF), inspired by architectural vulnerability factor (AVF) in computer architecture community, aiming to standardize the quantification of AI model vulnerability against parameter corruptions. We define a model parameter's PVF as the probability that a corruption in that particular model parameter will result in an incorrect output. In this paper, we present several use cases on applying PVF to three types of tasks/models during inference -- recommendation (DLRM), vision classification (CNN), and text classification (BERT), while presenting an in-depth vulnerability analysis on DLRM. PVF can provide pivotal insights to AI hardware designers in balancing the tradeoff between fault protection and performance/efficiency such as mapping vulnerable AI parameter components to well-protected hardware modules. PVF metric is applicable to any AI model and has a potential to help unify and standardize AI vulnerability/resilience evaluation practice.
Paper Structure (15 sections, 1 equation, 9 figures, 1 table)

This paper contains 15 sections, 1 equation, 9 figures, 1 table.

Figures (9)

  • Figure 1: Fault injection experiments flow
  • Figure 2: DLRM architecture overview
  • Figure 3: PVF of DLRM Parameters under MBF
  • Figure 4: PVF of DLRM Parameters under MBBF
  • Figure 5: PVF of DLRM Parameters under SBF
  • ...and 4 more figures