Table of Contents
Fetching ...

Unifying Image Counterfactuals and Feature Attributions with Latent-Space Adversarial Attacks

Jeremy Goldwasser, Giles Hooker

TL;DR

The paper tackles the difficulty of producing meaningful image counterfactuals by introducing Counterfactual Attacks, which perform gradient-based edits in a low-dimensional latent space to traverse the data manifold toward a target prediction. It unifies counterfactual generation with feature attributions by training lightweight attribute predictors in latent space and computing per-attribute changes that aggregate into global explanations, all without extensive hyperparameter tuning. The authors demonstrate the approach on MNIST and CelebA, showing realistic counterfactuals and interpretable content changes, while also providing a mechanism to quantify what features drive changes via global attributions. This framework offers a practical, scalable path for interpretable vision models and model debugging, combining counterfactuals with quantitative attribution summaries to enhance trust and diagnosis.

Abstract

Counterfactuals are a popular framework for interpreting machine learning predictions. These what if explanations are notoriously challenging to create for computer vision models: standard gradient-based methods are prone to produce adversarial examples, in which imperceptible modifications to image pixels provoke large changes in predictions. We introduce a new, easy-to-implement framework for counterfactual images that can flexibly adapt to contemporary advances in generative modeling. Our method, Counterfactual Attacks, resembles an adversarial attack on the representation of the image along a low-dimensional manifold. In addition, given an auxiliary dataset of image descriptors, we show how to accompany counterfactuals with feature attribution that quantify the changes between the original and counterfactual images. These importance scores can be aggregated into global counterfactual explanations that highlight the overall features driving model predictions. While this unification is possible for any counterfactual method, it has particular computational efficiency for ours. We demonstrate the efficacy of our approach with the MNIST and CelebA datasets.

Unifying Image Counterfactuals and Feature Attributions with Latent-Space Adversarial Attacks

TL;DR

The paper tackles the difficulty of producing meaningful image counterfactuals by introducing Counterfactual Attacks, which perform gradient-based edits in a low-dimensional latent space to traverse the data manifold toward a target prediction. It unifies counterfactual generation with feature attributions by training lightweight attribute predictors in latent space and computing per-attribute changes that aggregate into global explanations, all without extensive hyperparameter tuning. The authors demonstrate the approach on MNIST and CelebA, showing realistic counterfactuals and interpretable content changes, while also providing a mechanism to quantify what features drive changes via global attributions. This framework offers a practical, scalable path for interpretable vision models and model debugging, combining counterfactuals with quantitative attribution summaries to enhance trust and diagnosis.

Abstract

Counterfactuals are a popular framework for interpreting machine learning predictions. These what if explanations are notoriously challenging to create for computer vision models: standard gradient-based methods are prone to produce adversarial examples, in which imperceptible modifications to image pixels provoke large changes in predictions. We introduce a new, easy-to-implement framework for counterfactual images that can flexibly adapt to contemporary advances in generative modeling. Our method, Counterfactual Attacks, resembles an adversarial attack on the representation of the image along a low-dimensional manifold. In addition, given an auxiliary dataset of image descriptors, we show how to accompany counterfactuals with feature attribution that quantify the changes between the original and counterfactual images. These importance scores can be aggregated into global counterfactual explanations that highlight the overall features driving model predictions. While this unification is possible for any counterfactual method, it has particular computational efficiency for ours. We demonstrate the efficacy of our approach with the MNIST and CelebA datasets.

Paper Structure

This paper contains 19 sections, 11 equations, 10 figures, 1 algorithm.

Figures (10)

  • Figure 1: Running Counterfactual Attacks on MNIST dataset.
  • Figure 2: Counterfactuals for smiling classifier.
  • Figure 3: Counterfactual images with accompanying importance scores. Each row presents an individual counterfactual on a separate CNN classifier. All scores are negative, indicating the removal of their features.
  • Figure 4: Global counterfactual explanations for three CelebA classifiers. The direction indicates whether the feature is added or removed.
  • Figure 5: Counterfactuals that alter misclassified inputs to correct model predictions.
  • ...and 5 more figures