Global Counterfactual Directions

Bartlomiej Sobieski; Przemysław Biecek

Global Counterfactual Directions

Bartlomiej Sobieski, Przemysław Biecek

TL;DR

This work addresses the limitation of local counterfactual explanations by revealing global directions in the semantic latent space of a Diffusion Autoencoder (DiffAE) that flip a classifier's decision across an entire dataset. It introduces a proxy-based method to discover two types of global directions, g-directions and h-directions, using only a single image in a black-box setting, and shows these directions transfer across datasets. By combining GCDs with Latent Integrated Gradients, the authors define BB-LIG as a black-box attribution method that highlights regions influencing the classifier and facilitates understanding of counterfactuals. Experiments on CelebA, CelebA-HQ, and CheXpert demonstrate strong globality of g-directions, diversity from h-directions, and practical applicability, with results competitive to white-box methods and robust to out-of-domain data. The work also provides extensive metrics, ablations, and code, enabling broader adoption and future extensions to other generative models and attribution tasks.

Abstract

Despite increasing progress in development of methods for generating visual counterfactual explanations, especially with the recent rise of Denoising Diffusion Probabilistic Models, previous works consider them as an entirely local technique. In this work, we take the first step at globalizing them. Specifically, we discover that the latent space of Diffusion Autoencoders encodes the inference process of a given classifier in the form of global directions. We propose a novel proxy-based approach that discovers two types of these directions with the use of only single image in an entirely black-box manner. Precisely, g-directions allow for flipping the decision of a given classifier on an entire dataset of images, while h-directions further increase the diversity of explanations. We refer to them in general as Global Counterfactual Directions (GCDs). Moreover, we show that GCDs can be naturally combined with Latent Integrated Gradients resulting in a new black-box attribution method, while simultaneously enhancing the understanding of counterfactual explanations. We validate our approach on existing benchmarks and show that it generalizes to real-world use-cases.

Global Counterfactual Directions

TL;DR

Abstract

Paper Structure (36 sections, 10 equations, 19 figures, 11 tables)

This paper contains 36 sections, 10 equations, 19 figures, 11 tables.

Introduction
Related work
Background
Visual counterfactual explanations
Diffusion models
Integrated Gradients
Method
Experiments
Evaluation of g-directions
Evaluation of h-directions
Understanding counterfactual explanations
Conclusions
Personal data / human subjects
Appendix overview
Pseudocode
...and 21 more sections

Figures (19)

Figure 1: Conceptual summary of the introduced method. We first encode the source image and create local perturbations of its semantic latent representation. By generating the images from these perturbations and putting them through LPIPS and the target classifier, we obtain training data for the proxy network which locally approximates the counterfactual loss. Using a trained proxy, we discover g- and h-directions, which we term Global Counterfactual Directions, as single direction allows for generating counterfactual explanations for the entire dataset of images.
Figure 1: Absolute differences of original images and its CEs resulting from class-specific (indicated above) g-direction averaged over a set of $128$ images.
Figure 2: (Left) Factual (top) and counterfactual (bottom) images from a single g-direction for age class on CelebA-HQ. Semantic changes vary from image to image. (Right) Influence of g-directions resulting from different source images for age (top) and smile (bottom). Due to different semantic changes, they can act alone as source of diversity.
Figure 2: Predicted probability of the target classifier with respect to step size for each class on CelebA datasets. For simplicity, all probabilities were converted so that the original prediction is greater than 0.5. That is, if the initial classifier's prediction $y$ is smaller than 0.5, we convert it to $1- y$. Otherwise, we do not change it.
Figure 3: Globality of the h-directions. For each dataset-class pair, FR of each direction on a set of 128 images is plotted. The behavior varies across all considered cases.
...and 14 more figures

Global Counterfactual Directions

TL;DR

Abstract

Global Counterfactual Directions

Authors

TL;DR

Abstract

Table of Contents

Figures (19)