Table of Contents
Fetching ...

On Gradient-like Explanation under a Black-box Setting: When Black-box Explanations Become as Good as White-box

Yi Cai, Gerhard Wunder

TL;DR

This work tackles the problem of producing gradient-like explanations when model internals are inaccessible. It introduces GEEX, a gradient-estimation-based explanation that performs a path-integral attribution from a baseline to the explicand using only query access, and proves that it satisfies core attribution axioms including Completeness and Sensitivity. Empirical results on MNIST, Fashion-MNIST, and ImageNet show GEEX yields sharp, gradient-like attributions that outpace black-box baselines and converge toward white-box IG with more queries, indicating strong practical utility in restricted-access settings. The approach is parallelizable and adaptable, with future work targeting variance reduction and feature-space decomposition via Linearity to further improve efficiency and scalability.

Abstract

Attribution methods shed light on the explainability of data-driven approaches such as deep learning models by uncovering the most influential features in a to-be-explained decision. While determining feature attributions via gradients delivers promising results, the internal access required for acquiring gradients can be impractical under safety concerns, thus limiting the applicability of gradient-based approaches. In response to such limited flexibility, this paper presents \methodAbr~(gradient-estimation-based explanation), an approach that produces gradient-like explanations through only query-level access. The proposed approach holds a set of fundamental properties for attribution methods, which are mathematically rigorously proved, ensuring the quality of its explanations. In addition to the theoretical analysis, with a focus on image data, the experimental results empirically demonstrate the superiority of the proposed method over state-of-the-art black-box methods and its competitive performance compared to methods with full access.

On Gradient-like Explanation under a Black-box Setting: When Black-box Explanations Become as Good as White-box

TL;DR

This work tackles the problem of producing gradient-like explanations when model internals are inaccessible. It introduces GEEX, a gradient-estimation-based explanation that performs a path-integral attribution from a baseline to the explicand using only query access, and proves that it satisfies core attribution axioms including Completeness and Sensitivity. Empirical results on MNIST, Fashion-MNIST, and ImageNet show GEEX yields sharp, gradient-like attributions that outpace black-box baselines and converge toward white-box IG with more queries, indicating strong practical utility in restricted-access settings. The approach is parallelizable and adaptable, with future work targeting variance reduction and feature-space decomposition via Linearity to further improve efficiency and scalability.

Abstract

Attribution methods shed light on the explainability of data-driven approaches such as deep learning models by uncovering the most influential features in a to-be-explained decision. While determining feature attributions via gradients delivers promising results, the internal access required for acquiring gradients can be impractical under safety concerns, thus limiting the applicability of gradient-based approaches. In response to such limited flexibility, this paper presents \methodAbr~(gradient-estimation-based explanation), an approach that produces gradient-like explanations through only query-level access. The proposed approach holds a set of fundamental properties for attribution methods, which are mathematically rigorously proved, ensuring the quality of its explanations. In addition to the theoretical analysis, with a focus on image data, the experimental results empirically demonstrate the superiority of the proposed method over state-of-the-art black-box methods and its competitive performance compared to methods with full access.
Paper Structure (18 sections, 2 theorems, 52 equations, 5 figures, 1 table)

This paper contains 18 sections, 2 theorems, 52 equations, 5 figures, 1 table.

Key Result

Theorem 1

GEEX, a path method built upon estimated gradients, satisfies Sensitivity, Insensitivity, Implementation Invariance, and Linearity.

Figures (5)

  • Figure 1: A simple case shows that considering the estimated gradient as an explanation can lead to misleading results. Suffering from gradient saturation, the attribution of $x$ converges to 0 as its value increases, conflicting with the truth that the value of the sigmoid function $f(\cdot)$ relies solely on $x$.
  • Figure 2: Given a baseline $f(-3)\approx0$, the smoothed version of GEEX better approximates the actual contribution of the input feature with the same amount of observations. While the red solid line corresponds to explanations from the interpolation-based GEEX, the green line represents the results from the "smoothed" version, almost overlapping the actual contribution depicted by the blue line. The dashed line indicates the error of the derived explanation compared to the ground truth given by the total contribution $f(x)$.
  • Figure 3: Overview of GEEX. A query $\boldsymbol{z}$ is determined by the sampled noise $\boldsymbol{\epsilon}$ and the position $\alpha$ on the path. The final explanation $\boldsymbol{\xi}$ (on the right, overlaid with the original input) is derived through the observations $\{f(\boldsymbol{z})\}$ and the pre-computed log derivatives.
  • Figure 4: Sample explanations from the selected competitors
  • Figure 5: For InceptionV3, GEEX achieves an AOPC score converging to IG when $n^*$ increases.

Theorems & Definitions (9)

  • Theorem 1
  • Theorem 2
  • proof
  • proof
  • proof
  • proof
  • proof
  • Definition 3
  • proof