On Gradient-like Explanation under a Black-box Setting: When Black-box Explanations Become as Good as White-box
Yi Cai, Gerhard Wunder
TL;DR
This work tackles the problem of producing gradient-like explanations when model internals are inaccessible. It introduces GEEX, a gradient-estimation-based explanation that performs a path-integral attribution from a baseline to the explicand using only query access, and proves that it satisfies core attribution axioms including Completeness and Sensitivity. Empirical results on MNIST, Fashion-MNIST, and ImageNet show GEEX yields sharp, gradient-like attributions that outpace black-box baselines and converge toward white-box IG with more queries, indicating strong practical utility in restricted-access settings. The approach is parallelizable and adaptable, with future work targeting variance reduction and feature-space decomposition via Linearity to further improve efficiency and scalability.
Abstract
Attribution methods shed light on the explainability of data-driven approaches such as deep learning models by uncovering the most influential features in a to-be-explained decision. While determining feature attributions via gradients delivers promising results, the internal access required for acquiring gradients can be impractical under safety concerns, thus limiting the applicability of gradient-based approaches. In response to such limited flexibility, this paper presents \methodAbr~(gradient-estimation-based explanation), an approach that produces gradient-like explanations through only query-level access. The proposed approach holds a set of fundamental properties for attribution methods, which are mathematically rigorously proved, ensuring the quality of its explanations. In addition to the theoretical analysis, with a focus on image data, the experimental results empirically demonstrate the superiority of the proposed method over state-of-the-art black-box methods and its competitive performance compared to methods with full access.
