Table of Contents
Fetching ...

"Faithful to What?" On the Limits of Fidelity-Based Explanations

Jackson Eshbaugh

TL;DR

The paper addresses a key limitation of fidelity-based explanations in explainable AI: fidelity measures alignment to a learned network rather than the underlying data-generating signal. It introduces the linearity score $λ(f)$, defined as the $R^2$ between a network's function $f$ and its best linear surrogate $g$, to quantify linear decodability of the input--output behavior. Across synthetic and real regression datasets, the authors show that surrogates can achieve high fidelity to the network yet miss the predictive gains that distinguish the network from simpler models, and in some cases even underperform linear baselines trained directly on the data. The findings suggest that fidelity-based explanations reveal model behavior but not necessarily task-relevant structure, with the $λ(f)$ diagnostic helping to temper overreliance on fidelity, especially under distribution shift.

Abstract

In explainable AI, surrogate models are commonly evaluated by their fidelity to a neural network's predictions. Fidelity, however, measures alignment to a learned model rather than alignment to the data-generating signal underlying the task. This work introduces the linearity score $λ(f)$, a diagnostic that quantifies the extent to which a regression network's input--output behavior is linearly decodable. $λ(f)$ is defined as an $R^2$ measure of surrogate fit to the network. Across synthetic and real-world regression datasets, we find that surrogates can achieve high fidelity to a neural network while failing to recover the predictive gains that distinguish the network from simpler models. In several cases, high-fidelity surrogates underperform even linear baselines trained directly on the data. These results demonstrate that explaining a model's behavior is not equivalent to explaining the task-relevant structure of the data, highlighting a limitation of fidelity-based explanations when used to reason about predictive performance.

"Faithful to What?" On the Limits of Fidelity-Based Explanations

TL;DR

The paper addresses a key limitation of fidelity-based explanations in explainable AI: fidelity measures alignment to a learned network rather than the underlying data-generating signal. It introduces the linearity score , defined as the between a network's function and its best linear surrogate , to quantify linear decodability of the input--output behavior. Across synthetic and real regression datasets, the authors show that surrogates can achieve high fidelity to the network yet miss the predictive gains that distinguish the network from simpler models, and in some cases even underperform linear baselines trained directly on the data. The findings suggest that fidelity-based explanations reveal model behavior but not necessarily task-relevant structure, with the diagnostic helping to temper overreliance on fidelity, especially under distribution shift.

Abstract

In explainable AI, surrogate models are commonly evaluated by their fidelity to a neural network's predictions. Fidelity, however, measures alignment to a learned model rather than alignment to the data-generating signal underlying the task. This work introduces the linearity score , a diagnostic that quantifies the extent to which a regression network's input--output behavior is linearly decodable. is defined as an measure of surrogate fit to the network. Across synthetic and real-world regression datasets, we find that surrogates can achieve high fidelity to a neural network while failing to recover the predictive gains that distinguish the network from simpler models. In several cases, high-fidelity surrogates underperform even linear baselines trained directly on the data. These results demonstrate that explaining a model's behavior is not equivalent to explaining the task-relevant structure of the data, highlighting a limitation of fidelity-based explanations when used to reason about predictive performance.

Paper Structure

This paper contains 8 sections, 3 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Results from the California dataset experiments.
  • Figure 2: Predictions from the baseline linear model, neural network, and linear surrogate on the synthetic dataset.
  • Figure 3: Predictions from the baseline linear model, neural network, and linear surrogate on the Medical Insurance Cost dataset. Despite closely matching the network $(\lambda(f) = 0.9186)$, the surrogate underperforms on the true target.