Trust Regions for Explanations via Black-Box Probabilistic Certification

Amit Dhurandhar; Swagatam Haldar; Dennis Wei; Karthikeyan Natesan Ramamurthy

Trust Regions for Explanations via Black-Box Probabilistic Certification

Amit Dhurandhar, Swagatam Haldar, Dennis Wei, Karthikeyan Natesan Ramamurthy

TL;DR

This work tackles the problem of certifying explanations for black-box models by identifying the largest trust region around a given input where the explanation retains a target fidelity with high probability. It introduces Ecertify, a framework that uses three sampling strategies—unif, unifI, and adaptI—to probabilistically certify regions under a query budget, with rigorous finite-sample and asymptotic guarantees. The approach leverages region-wise fidelity analysis, decomposition across hypercube partitions, and both Lipschitz and piecewise-linear assumptions to derive bounds and enable faster certification in high dimensions. Experiments across synthetic and real datasets demonstrate substantial query savings and robust region-wide validity for popular explainers like LIME, SHAP, and RISE, offering a practical path to compare explanations and reuse certified regions in deployment.

Abstract

Given the black box nature of machine learning models, a plethora of explainability methods have been developed to decipher the factors behind individual decisions. In this paper, we introduce a novel problem of black box (probabilistic) explanation certification. We ask the question: Given a black box model with only query access, an explanation for an example and a quality metric (viz. fidelity, stability), can we find the largest hypercube (i.e., $\ell_{\infty}$ ball) centered at the example such that when the explanation is applied to all examples within the hypercube, (with high probability) a quality criterion is met (viz. fidelity greater than some value)? Being able to efficiently find such a \emph{trust region} has multiple benefits: i) insight into model behavior in a \emph{region}, with a \emph{guarantee}; ii) ascertained \emph{stability} of the explanation; iii) \emph{explanation reuse}, which can save time, energy and money by not having to find explanations for every example; and iv) a possible \emph{meta-metric} to compare explanation methods. Our contributions include formalizing this problem, proposing solutions, providing theoretical guarantees for these solutions that are computable, and experimentally showing their efficacy on synthetic and real data.

Trust Regions for Explanations via Black-Box Probabilistic Certification

TL;DR

Abstract

ball) centered at the example such that when the explanation is applied to all examples within the hypercube, (with high probability) a quality criterion is met (viz. fidelity greater than some value)? Being able to efficiently find such a \emph{trust region} has multiple benefits: i) insight into model behavior in a \emph{region}, with a \emph{guarantee}; ii) ascertained \emph{stability} of the explanation; iii) \emph{explanation reuse}, which can save time, energy and money by not having to find explanations for every example; and iv) a possible \emph{meta-metric} to compare explanation methods. Our contributions include formalizing this problem, proposing solutions, providing theoretical guarantees for these solutions that are computable, and experimentally showing their efficacy on synthetic and real data.

Paper Structure (26 sections, 8 theorems, 31 equations, 12 figures, 8 tables, 2 algorithms)

This paper contains 26 sections, 8 theorems, 31 equations, 12 figures, 8 tables, 2 algorithms.

Introduction
Problem Formulation
Related Work
Method
Analysis
Bound Estimation and Special Cases
Bound Estimation
Special Cases
Experiments
Discussion Section
Applicability to other explanation method types:
Proofs for Results in Section \ref{['sec:ans']}
Topics Related to Extreme Value Theory
i.i.d. unifI strategy
Proof of Corollary \ref{['cor:EVT']}
...and 11 more sections

Key Result

Lemma 1

The probability that $\hat{f}^*_w$ and $f^*_w$ differ by at most $\epsilon$ decomposes over regions as follows: where $w_0=0$.

Figures (12)

Figure 1: Illustration of our three certification strategies. (a) depicts one of the final steps of the unif strategy, while (b) and (c) depict two consecutive close-to-final steps of unifI and adaptI respectively. The setup is the same as in Section \ref{['sec:exp']} with $d=2,~Q=1000$. The boxes have width $w=0.5$ which is the optimal width. The star in the center denotes the example whose explanation we want to certify, while the orange lines are level sets for fidelity ($\theta = 0.75$). The methods' different behaviors are apparent: unif queries examples uniformly at random, while unifI uniformly samples prototypes (blue stars) and then queries examples around these prototypes (green blobs). From one step to the next, unifI doubles the number of prototypes and halves the number of examples queried around each prototype. Contrastingly, adaptI, in the innermost loop, halves the number of prototypes where it adaptively queries more around prototypes close to low fidelity examples (lower left and upper right corners).
Figure 2: Each row corresponds to a dataset (Row #: 1-ImageNet, 2-CIFAR10, 3-Arrhythmia, 4,5-HELOC). First two columns are LIME half-width and timing results, while the last two columns are the same for SHAP. Our methods are significantly faster than ZO$^+$, while still converging to similar $w$ in most cases. It seems unif, unifI and adaptI are best for low ($100$s or lower), intermediate ($\approx 1000$) and high dimensions ($10000$s) respectively. Trusting the converged upon half-widths, one can also compare XAI methods as discussed below.
Figure 3: Left we see sample and query savings using adaptI on HELOC dataset for LIME, where $Q=1000$. With an order of magnitude less samples and with less than 20% queries of those needed by LIME we can find explanations for the dataset. Right we see means and standard deviations for the actual fidelities of the covered samples for each subset. As can be seen when considering the entire (effective) dataset (rightmost point), our regions satisfy the $\theta$ constraint with high probability.
Figure 4: Visualization of a certified explanation (black) for a randomly chosen example in the HELOC dataset along with explanations for examples lying within the trust region (nine green pentagons on the left) and those (randomly chosen nine) lying outside the trust region (orange pentagons on the right). The explanations depict the feature importances for the top-5 features based on the certified explanation. f1, f4, f15, f18 and f5 denote 'ExternalRiskEstimate', 'AverageMInFile', 'MSinceMostRecentInqexcl7days', 'NetFractionRevolvingBurden' and 'NumSatisfactoryTrades' respectively. As can be seen explanations for examples within our trust region are significantly more similar to the certified explanation than those outside of it.
Figure 5: Certified half-widths vs. dimensions plots for the synthetic data set-up with different choices of setting $B$ for the proposed 3 strategies. Note that, choosing min is (slightly) conservative as it is almost always below the true certified width (the black solid curve) and both max and mean overestimate the true width (the y-axis is in log-scale).
...and 7 more figures

Theorems & Definitions (14)

Lemma 1
Lemma 2
Lemma 3
Lemma 4
Theorem 1
Proposition 1
Corollary 1
proof : Lemma \ref{['lem1']} proof
proof : Lemma 2 proof
proof : Lemma 3 proof
...and 4 more

Trust Regions for Explanations via Black-Box Probabilistic Certification

TL;DR

Abstract

Trust Regions for Explanations via Black-Box Probabilistic Certification

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (14)