The Manifold Hypothesis for Gradient-Based Explanations

Sebastian Bordt; Uddeshya Upadhyay; Zeynep Akata; Ulrike von Luxburg

The Manifold Hypothesis for Gradient-Based Explanations

Sebastian Bordt, Uddeshya Upadhyay, Zeynep Akata, Ulrike von Luxburg

TL;DR

This work investigates when gradient-based explanations for image classifiers are perceptually meaningful by proposing a manifold hypothesis: feature attributions are more perceptually aligned when they lie in the tangent space $\mathcal{T}_x$ of the data manifold. The authors estimate image manifolds via variational autoencoders (and reconstructive autoencoders) and quantify alignment by projecting attributions onto $\mathcal{T}_x$, using the metric $\|\text{proj}_{\mathcal{T}_x} E\|_2 / \|E\|_2$ and comparing to the random baseline $\sqrt{k/d}$. Across datasets (MNIST variants, EMNIST, CIFAR10, X-ray Pneumonia, Diabetic Retinopathy), tangent-space components correlate with perceptual clarity, and post-hoc methods (Integrated Gradients, SmoothGrad, Input $\times$ Gradient) along with $l_2$ adversarial training further improve alignment. The study also shows that tangent-space alignment is necessary but not sufficient for explanations and emphasizes that explanations must respect both the model and the data, with code available for replication.

Abstract

When do gradient-based explanation algorithms provide perceptually-aligned explanations? We propose a criterion: the feature attributions need to be aligned with the tangent space of the data manifold. To provide evidence for this hypothesis, we introduce a framework based on variational autoencoders that allows to estimate and generate image manifolds. Through experiments across a range of different datasets -- MNIST, EMNIST, CIFAR10, X-ray pneumonia and Diabetic Retinopathy detection -- we demonstrate that the more a feature attribution is aligned with the tangent space of the data, the more perceptually-aligned it tends to be. We then show that the attributions provided by popular post-hoc methods such as Integrated Gradients and SmoothGrad are more strongly aligned with the data manifold than the raw gradient. Adversarial training also improves the alignment of model gradients with the data manifold. As a consequence, we suggest that explanation algorithms should actively strive to align their explanations with the data manifold. This is an extended version of a CVPR Workshop paper. Code is available at https://github.com/tml-tuebingen/explanations-manifold.

The Manifold Hypothesis for Gradient-Based Explanations

TL;DR

of the data manifold. The authors estimate image manifolds via variational autoencoders (and reconstructive autoencoders) and quantify alignment by projecting attributions onto

, using the metric

and comparing to the random baseline

. Across datasets (MNIST variants, EMNIST, CIFAR10, X-ray Pneumonia, Diabetic Retinopathy), tangent-space components correlate with perceptual clarity, and post-hoc methods (Integrated Gradients, SmoothGrad, Input

Gradient) along with

adversarial training further improve alignment. The study also shows that tangent-space alignment is necessary but not sufficient for explanations and emphasizes that explanations must respect both the model and the data, with code available for replication.

Abstract

Paper Structure (38 sections, 1 theorem, 31 equations, 24 figures, 1 table, 1 algorithm)

This paper contains 38 sections, 1 theorem, 31 equations, 24 figures, 1 table, 1 algorithm.

Introduction
Related Work
Overview of our approach: Measuring alignment with the image manifold
Background
Data manifolds and tangent spaces.
Model gradients and explanation algorithms.
How do we know the image manifold?
Measuring alignment with the image manifold
Experimental Results
Experimental Setup
Datasets.
The part of an attribution that lies in tangent space is perceptually-aligned
Post-hoc methods align attributions with the data manifold
Attributions more aligned with the data manifold are more perceptually-aligned
The tangent space gives rise to a notion of feature importance
...and 23 more sections

Key Result

Theorem 1

For every dimension $d>1$, there exists a manifold $\mathcal{M}_d\subset\mathbb{R}^d$, a probability distribution $\mathcal{D}$ on $\mathcal{M}_d\times\{-1,1\}$ and a maximum-margin classifier with zero test error given given by such that

Figures (24)

Figure 1: Conceptual overview of our approach. We first estimate the data manifold of an existing dataset with a variational autoencoder, then use the decoder as a generative model. On the generated data, we train a classifier $f$. For this classifier, we evaluate whether different gradient based explanations $\mathcal{E}_i$ align with the tangent space of the data manifold. Moving along an explanation aligned with the tangent space keeps us in the manifold, whereas moving along an orthogonal explanation takes us out of manifold. Our hypothesis is that the latter does not lead to perceptually-aligned explanations because it describes changes that lead to unnatural images.
Figure 2: The part of an attribution that lies in the tangent space is perceptually-aligned, whereas the part that is orthogonal to the tangent space is not. (First row) Images from the test set of MNIST32. (Second row) The part of the attribution that lies in tangent space. (Third row) The part of attribution that is orthogonal to the tangent space. Red corresponds to positive, blue to negative attribution (best viewed in digital format). Additional attributions for more images are depicted in appendix Figure \ref{['fig:apx_mnist_32_additional_attributions']}.
Figure 3: Post-hoc explanation methods improve the alignment of model gradients with the data manifold. Figure shows the fraction of four different explanation methods in tangent space on six different datasets. Gray line indicates the random baseline $\sqrt{k/d}$ (compare Sec. \ref{['sec:measurement']}).
Figure 4: Feature attributions that are more aligned with the data manifold are more explanatory. (Top row) CIFAR10, (Middle row) X-Ray Pneumonia and (Bottom row) Diabetic Retinopathy. The number below an attribution depicts the fraction of the attribution in tangent space.
Figure 5: The tangent space gives rise to a notion of feature importance. Figure shows the ROAR benchmark on MNIST32. Additional figures for other datasets are in appendix \ref{['apx:figures']}.
...and 19 more figures

Theorems & Definitions (2)

Theorem 1: Generalization does not imply alignment of gradients with the data manifold
proof

The Manifold Hypothesis for Gradient-Based Explanations

TL;DR

Abstract

The Manifold Hypothesis for Gradient-Based Explanations

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (24)

Theorems & Definitions (2)