Trust Regions for Explanations via Black-Box Probabilistic Certification
Amit Dhurandhar, Swagatam Haldar, Dennis Wei, Karthikeyan Natesan Ramamurthy
TL;DR
This work tackles the problem of certifying explanations for black-box models by identifying the largest trust region around a given input where the explanation retains a target fidelity with high probability. It introduces Ecertify, a framework that uses three sampling strategies—unif, unifI, and adaptI—to probabilistically certify regions under a query budget, with rigorous finite-sample and asymptotic guarantees. The approach leverages region-wise fidelity analysis, decomposition across hypercube partitions, and both Lipschitz and piecewise-linear assumptions to derive bounds and enable faster certification in high dimensions. Experiments across synthetic and real datasets demonstrate substantial query savings and robust region-wide validity for popular explainers like LIME, SHAP, and RISE, offering a practical path to compare explanations and reuse certified regions in deployment.
Abstract
Given the black box nature of machine learning models, a plethora of explainability methods have been developed to decipher the factors behind individual decisions. In this paper, we introduce a novel problem of black box (probabilistic) explanation certification. We ask the question: Given a black box model with only query access, an explanation for an example and a quality metric (viz. fidelity, stability), can we find the largest hypercube (i.e., $\ell_{\infty}$ ball) centered at the example such that when the explanation is applied to all examples within the hypercube, (with high probability) a quality criterion is met (viz. fidelity greater than some value)? Being able to efficiently find such a \emph{trust region} has multiple benefits: i) insight into model behavior in a \emph{region}, with a \emph{guarantee}; ii) ascertained \emph{stability} of the explanation; iii) \emph{explanation reuse}, which can save time, energy and money by not having to find explanations for every example; and iv) a possible \emph{meta-metric} to compare explanation methods. Our contributions include formalizing this problem, proposing solutions, providing theoretical guarantees for these solutions that are computable, and experimentally showing their efficacy on synthetic and real data.
