Unifying Image Counterfactuals and Feature Attributions with Latent-Space Adversarial Attacks

Jeremy Goldwasser; Giles Hooker

Unifying Image Counterfactuals and Feature Attributions with Latent-Space Adversarial Attacks

Jeremy Goldwasser, Giles Hooker

TL;DR

The paper tackles the difficulty of producing meaningful image counterfactuals by introducing Counterfactual Attacks, which perform gradient-based edits in a low-dimensional latent space to traverse the data manifold toward a target prediction. It unifies counterfactual generation with feature attributions by training lightweight attribute predictors in latent space and computing per-attribute changes that aggregate into global explanations, all without extensive hyperparameter tuning. The authors demonstrate the approach on MNIST and CelebA, showing realistic counterfactuals and interpretable content changes, while also providing a mechanism to quantify what features drive changes via global attributions. This framework offers a practical, scalable path for interpretable vision models and model debugging, combining counterfactuals with quantitative attribution summaries to enhance trust and diagnosis.

Abstract

Counterfactuals are a popular framework for interpreting machine learning predictions. These what if explanations are notoriously challenging to create for computer vision models: standard gradient-based methods are prone to produce adversarial examples, in which imperceptible modifications to image pixels provoke large changes in predictions. We introduce a new, easy-to-implement framework for counterfactual images that can flexibly adapt to contemporary advances in generative modeling. Our method, Counterfactual Attacks, resembles an adversarial attack on the representation of the image along a low-dimensional manifold. In addition, given an auxiliary dataset of image descriptors, we show how to accompany counterfactuals with feature attribution that quantify the changes between the original and counterfactual images. These importance scores can be aggregated into global counterfactual explanations that highlight the overall features driving model predictions. While this unification is possible for any counterfactual method, it has particular computational efficiency for ours. We demonstrate the efficacy of our approach with the MNIST and CelebA datasets.

Unifying Image Counterfactuals and Feature Attributions with Latent-Space Adversarial Attacks

TL;DR

Abstract

Unifying Image Counterfactuals and Feature Attributions with Latent-Space Adversarial Attacks

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)