The Manifold Hypothesis for Gradient-Based Explanations
Sebastian Bordt, Uddeshya Upadhyay, Zeynep Akata, Ulrike von Luxburg
TL;DR
This work investigates when gradient-based explanations for image classifiers are perceptually meaningful by proposing a manifold hypothesis: feature attributions are more perceptually aligned when they lie in the tangent space $\mathcal{T}_x$ of the data manifold. The authors estimate image manifolds via variational autoencoders (and reconstructive autoencoders) and quantify alignment by projecting attributions onto $\mathcal{T}_x$, using the metric $\|\text{proj}_{\mathcal{T}_x} E\|_2 / \|E\|_2$ and comparing to the random baseline $\sqrt{k/d}$. Across datasets (MNIST variants, EMNIST, CIFAR10, X-ray Pneumonia, Diabetic Retinopathy), tangent-space components correlate with perceptual clarity, and post-hoc methods (Integrated Gradients, SmoothGrad, Input $\times$ Gradient) along with $l_2$ adversarial training further improve alignment. The study also shows that tangent-space alignment is necessary but not sufficient for explanations and emphasizes that explanations must respect both the model and the data, with code available for replication.
Abstract
When do gradient-based explanation algorithms provide perceptually-aligned explanations? We propose a criterion: the feature attributions need to be aligned with the tangent space of the data manifold. To provide evidence for this hypothesis, we introduce a framework based on variational autoencoders that allows to estimate and generate image manifolds. Through experiments across a range of different datasets -- MNIST, EMNIST, CIFAR10, X-ray pneumonia and Diabetic Retinopathy detection -- we demonstrate that the more a feature attribution is aligned with the tangent space of the data, the more perceptually-aligned it tends to be. We then show that the attributions provided by popular post-hoc methods such as Integrated Gradients and SmoothGrad are more strongly aligned with the data manifold than the raw gradient. Adversarial training also improves the alignment of model gradients with the data manifold. As a consequence, we suggest that explanation algorithms should actively strive to align their explanations with the data manifold. This is an extended version of a CVPR Workshop paper. Code is available at https://github.com/tml-tuebingen/explanations-manifold.
