Guarantee Regions for Local Explanations

Marton Havasi; Sonali Parbhoo; Finale Doshi-Velez

Guarantee Regions for Local Explanations

Marton Havasi, Sonali Parbhoo, Finale Doshi-Velez

TL;DR

The paper addresses the problem that local surrogate explanations may not extrapolate faithfully beyond their target point. It introduces anchor boxes with explicit statistical guarantees, defining fidelity via $|f(x)-g(x)|<\epsilon$, a purity level $\rho$, and a confidence $1-\delta$, and formalizes the faithful region as an axis-aligned box with volume $|\mathbb{A}|=\prod_{d=1}^D (u_d-l_d)$. A key result proves that exactly finding the maximal anchor box is exponentially hard in dimension, motivating a scalable divide-and-conquer algorithm FindAnchor that builds large boxes by progressively merging one-dimensional anchors using a maximum-box subroutine FindMB and a sequence of statistical tests to ensure purity. Empirically, the method yields larger guarantee regions than baselines and can identify when a local explanation is dishonest, across multiple datasets and surrogate settings, while maintaining reasonable computational requirements. The approach offers a principled, model-agnostic way to quantify and extend the trustworthy region of local explanations, with potential to improve interpretability and trust in safety-critical applications, especially where extrapolation beyond the local neighborhood matters.

Abstract

Interpretability methods that utilise local surrogate models (e.g. LIME) are very good at describing the behaviour of the predictive model at a point of interest, but they are not guaranteed to extrapolate to the local region surrounding the point. However, overfitting to the local curvature of the predictive model and malicious tampering can significantly limit extrapolation. We propose an anchor-based algorithm for identifying regions in which local explanations are guaranteed to be correct by explicitly describing those intervals along which the input features can be trusted. Our method produces an interpretable feature-aligned box where the prediction of the local surrogate model is guaranteed to match the predictive model. We demonstrate that our algorithm can be used to find explanations with larger guarantee regions that better cover the data manifold compared to existing baselines. We also show how our method can identify misleading local explanations with significantly poorer guarantee regions.

Guarantee Regions for Local Explanations

TL;DR

, a purity level

, and a confidence

, and formalizes the faithful region as an axis-aligned box with volume

. A key result proves that exactly finding the maximal anchor box is exponentially hard in dimension, motivating a scalable divide-and-conquer algorithm FindAnchor that builds large boxes by progressively merging one-dimensional anchors using a maximum-box subroutine FindMB and a sequence of statistical tests to ensure purity. Empirically, the method yields larger guarantee regions than baselines and can identify when a local explanation is dishonest, across multiple datasets and surrogate settings, while maintaining reasonable computational requirements. The approach offers a principled, model-agnostic way to quantify and extend the trustworthy region of local explanations, with potential to improve interpretability and trust in safety-critical applications, especially where extrapolation beyond the local neighborhood matters.

Abstract

Paper Structure (14 sections, 1 theorem, 8 equations, 4 figures, 7 tables, 2 algorithms)

This paper contains 14 sections, 1 theorem, 8 equations, 4 figures, 7 tables, 2 algorithms.

Introduction
Related Works
Anchor Boxes with Statistical Guarantees
The Computational Challenge: Identifying Maximal Anchor Boxes Grows Exponentially in Dimension
A Divide-and-Conquer Algorithm for Finding Large Anchor Boxes
Results and Discussion
Identifying when explanations are dishonest
Capturing a larger volume with anchor boxes
Capturing a local cluster using an anchor box
Hyperparameters
Conclusions
Theoretical results
Box expansion
Results on neural networks

Key Result

Theorem 3.2

If there exists a true function $e_0$ for the data and a set of possible faithfulness functions $\mathcal{E}=\{e_k: {\mathbb{R}}^D\rightarrow \{0, 1\}\}_{k=1}^K$ with $K={D \choose \lfloor \frac{D}{2} \rfloor}$ that agree on a collection of input points, but have different bounding boxes and $\frac{ Proof: See Appendix app:proof for details.

Figures (4)

Figure 1: (Left) The guarantee region captures the area around the anchor point $a$ in which the prediction of the complex model $f$ differs from the surrogate model $g$ by no more than $\epsilon$. (Right) Anchor box of the faithful region in $D=2$ dimensions (shown in red). The anchor box captures an axis-aligned box (defined by lower and upper bounds $l=(l_1, l_2)$ and $u=(u_1, u_2)$) where the surrogate model is guaranteed to be faithful to the complex model.
Figure 2: We compute the anchor boxes for local linear explanations where feature $k$ is given 0 weight. In the case of the honest model, feature $k$ is not used for the prediction. In the case of the dishonest model, the prediction relies on feature $k$. We see a clear separation between the honest and dishonest models. The anchor box for the honest model is large along feature $k$, while for dishonest model, the size of the box is much smaller.
Figure 2: The portion of the local cluster of points captured in the guarantee region ($\rho=0.99$, $\delta=0.01$) and the number of function evaluations $b$ they require. Synthetic dataset. The means and standard deviations are shown over 20 test anchor points. Our method is better at capturing the local cluster than the baselines.
Figure 3: Visualization of the Synthetic dataset at $D=2$. The anchor point is denoted by a black dot and it belongs to the green cluster of points. Our anchor box captures a large portion of the local cluster.

Theorems & Definitions (3)

Definition 3.1
Theorem 3.2
proof

Guarantee Regions for Local Explanations

TL;DR

Abstract

Guarantee Regions for Local Explanations

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (3)