Table of Contents
Fetching ...

Explanations are a Means to an End: Decision Theoretic Explanation Evaluation

Ziyang Guo, Berk Ustun, Jessica Hullman

TL;DR

The paper introduces a decision-theoretic framework that treats explanations as information signals whose value is measured by the expected improvement they enable on concrete decision tasks, formalizing three estimands: the Theoretic Value of Explanation, Human-Complementary Value, and Behavioral Value.It defines decision problems, signals, and rational benchmarks, derives theoretical upper bounds and decomposition of explanation value, and provides practical estimators that rely on coarsened information models to avoid overfitting and robust problem specification.Through demonstrations in human–AI decision support and mechanistic interpretability, the work shows that explanations can have substantial theoretical value and meaningful human-complementary effects, though behavioral gains are task-dependent and may be limited in some settings.The framework offers a principled pathway to validate and compare explanation methods using task-aligned metrics, guiding deployment and further research toward actionable, performance-enhancing explanations.

Abstract

Explanations of model behavior are commonly evaluated via proxy properties weakly tied to the purposes explanations serve in practice. We contribute a decision theoretic framework that treats explanations as information signals valued by the expected improvement they enable on a specified decision task. This approach yields three distinct estimands: 1) a theoretical benchmark that upperbounds achievable performance by any agent with the explanation, 2) a human-complementary value that quantifies the theoretically attainable value that is not already captured by a baseline human decision policy, and 3) a behavioral value representing the causal effect of providing the explanation to human decision-makers. We instantiate these definitions in a practical validation workflow, and apply them to assess explanation potential and interpret behavioral effects in human-AI decision support and mechanistic interpretability.

Explanations are a Means to an End: Decision Theoretic Explanation Evaluation

TL;DR

The paper introduces a decision-theoretic framework that treats explanations as information signals whose value is measured by the expected improvement they enable on concrete decision tasks, formalizing three estimands: the Theoretic Value of Explanation, Human-Complementary Value, and Behavioral Value.It defines decision problems, signals, and rational benchmarks, derives theoretical upper bounds and decomposition of explanation value, and provides practical estimators that rely on coarsened information models to avoid overfitting and robust problem specification.Through demonstrations in human–AI decision support and mechanistic interpretability, the work shows that explanations can have substantial theoretical value and meaningful human-complementary effects, though behavioral gains are task-dependent and may be limited in some settings.The framework offers a principled pathway to validate and compare explanation methods using task-aligned metrics, guiding deployment and further research toward actionable, performance-enhancing explanations.

Abstract

Explanations of model behavior are commonly evaluated via proxy properties weakly tied to the purposes explanations serve in practice. We contribute a decision theoretic framework that treats explanations as information signals valued by the expected improvement they enable on a specified decision task. This approach yields three distinct estimands: 1) a theoretical benchmark that upperbounds achievable performance by any agent with the explanation, 2) a human-complementary value that quantifies the theoretically attainable value that is not already captured by a baseline human decision policy, and 3) a behavioral value representing the causal effect of providing the explanation to human decision-makers. We instantiate these definitions in a practical validation workflow, and apply them to assess explanation potential and interpret behavioral effects in human-AI decision support and mechanistic interpretability.

Paper Structure

This paper contains 47 sections, 4 theorems, 14 equations, 3 figures, 2 tables, 1 algorithm.

Key Result

Proposition 1

Given a set of features $X$, a model prediction $\hat{Y}$, and an explanation $Z$ generated by a function taking as input features and model prediction (sec:preliminaries), gaining access to the explanation does not improve the expected performance of the idealized agent, i.e.,

Figures (3)

  • Figure 1: Quantities defined in our framework. The researcher can confirm that an explanation has potential by proceeding from the theoretic to the human-complementary to the behavioral value of explanation, comparing the estimates at lower levels to those above.
  • Figure 2: Re-analysis of human-AI decision support in two prior studies lai2019humanbansal2021does. Top: theoretic benchmarks (\ref{['sec:foundational_principles']}) against behavioral values (\ref{['sec:behavioral']}). Bottom: potential complementary value (\ref{['sec:human-complementary']}). Error bars give bootstrapped 95% CIs (N=1000).
  • Figure 3: Alignment audit results. Alignment-audit results: Theoretic value increases up to 3 SAE features then plateaus. Behavioral value increases with the number of features, but indicates substantial room to improve relative to the benchmarks.

Theorems & Definitions (22)

  • Example : Medical Decision Making
  • Definition 1
  • Proposition 1
  • Corollary 1
  • Definition 2
  • Definition 3
  • Example : Medical Treatment
  • Definition 4
  • Definition 5
  • Example : Medical Treatment
  • ...and 12 more