Table of Contents
Fetching ...

DiG-IN: Diffusion Guidance for Investigating Networks -- Uncovering Classifier Differences Neuron Visualisations and Visual Counterfactual Explanations

Maximilian Augustin, Yannic Neuhaus, Matthias Hein

TL;DR

DiG-IN addresses core reliability and interpretability challenges of image classifiers by offering a training-free diffusion-guided framework that optimizes latent diffusion inputs to produce realistic images for analysis. It unifies three analysis tasks—classifier disagreement, universal visual counterfactual explanations, and neuron activation visualizations—within a plug-and-play approach that works with any classifier and dataset. The method shows that diffusion-guided VCEs can outperform prior approaches in realism and semantic fidelity, reveals biases such as shape bias and zero-shot CLIP errors, and provides quantitative tools to distinguish core versus spurious neuron features. Collectively, DiG-IN provides a practical, scalable toolkit for debugging and validating vision models in safety-critical or real-world settings.

Abstract

While deep learning has led to huge progress in complex image classification tasks like ImageNet, unexpected failure modes, e.g. via spurious features, call into question how reliably these classifiers work in the wild. Furthermore, for safety-critical tasks the black-box nature of their decisions is problematic, and explanations or at least methods which make decisions plausible are needed urgently. In this paper, we address these problems by generating images that optimize a classifier-derived objective using a framework for guided image generation. We analyze the decisions of image classifiers by visual counterfactual explanations (VCEs), detection of systematic mistakes by analyzing images where classifiers maximally disagree, and visualization of neurons and spurious features. In this way, we validate existing observations, e.g. the shape bias of adversarially robust models, as well as novel failure modes, e.g. systematic errors of zero-shot CLIP classifiers. Moreover, our VCEs outperform previous work while being more versatile.

DiG-IN: Diffusion Guidance for Investigating Networks -- Uncovering Classifier Differences Neuron Visualisations and Visual Counterfactual Explanations

TL;DR

DiG-IN addresses core reliability and interpretability challenges of image classifiers by offering a training-free diffusion-guided framework that optimizes latent diffusion inputs to produce realistic images for analysis. It unifies three analysis tasks—classifier disagreement, universal visual counterfactual explanations, and neuron activation visualizations—within a plug-and-play approach that works with any classifier and dataset. The method shows that diffusion-guided VCEs can outperform prior approaches in realism and semantic fidelity, reveals biases such as shape bias and zero-shot CLIP errors, and provides quantitative tools to distinguish core versus spurious neuron features. Collectively, DiG-IN provides a practical, scalable toolkit for debugging and validating vision models in safety-critical or real-world settings.

Abstract

While deep learning has led to huge progress in complex image classification tasks like ImageNet, unexpected failure modes, e.g. via spurious features, call into question how reliably these classifiers work in the wild. Furthermore, for safety-critical tasks the black-box nature of their decisions is problematic, and explanations or at least methods which make decisions plausible are needed urgently. In this paper, we address these problems by generating images that optimize a classifier-derived objective using a framework for guided image generation. We analyze the decisions of image classifiers by visual counterfactual explanations (VCEs), detection of systematic mistakes by analyzing images where classifiers maximally disagree, and visualization of neurons and spurious features. In this way, we validate existing observations, e.g. the shape bias of adversarially robust models, as well as novel failure modes, e.g. systematic errors of zero-shot CLIP classifiers. Moreover, our VCEs outperform previous work while being more versatile.
Paper Structure (36 sections, 23 equations, 36 figures, 2 tables)

This paper contains 36 sections, 23 equations, 36 figures, 2 tables.

Figures (36)

  • Figure 1: Illustration of three tasks for debugging image classifiers.Left: we generate images where one classifier is highly confident in a class and the other is not and recover the shape bias of adversarially robust models compared to a standard model; Middle: we generate images when maximizing or minimizing a neuron. We identify one neuron labeled as spurious for "fiddler crab"in singla2021salient as associated to sand; Right: we produce visual counterfactual explanations for arbitrary image classifiers and outperform augustin2022diffusion.
  • Figure 2: Illustration of the forward diffusion process (black arrows) from the initial latent $z_T$ into the loss function $L$ and the gradient flow during backpropagation (purple arrows). The optimization variables $z_T, (\varnothing_t)_{t=1}^T$ and $(C_t)_{t=1}^T$ are marked with a dashed border. On the left, we illustrate the conditioning mechanism inside the denoising U-Net via cross-attention (XA) layers.
  • Figure 3: Classifier disagreement: shape bias of adversarially robust models. For a given class $y$, the first row shows the output of Stable Diffusion for "a photograph of $y$". The images in the second row have been optimized to maximize the confidence of an adversarially robust ViT-S while minimizing the one of a standard ViT-S. The resulting images retain the same shape but with smooth surfaces and little texture.
  • Figure 4: Classifier Disagreement: Images maximizing the disagreement between two classifiers $f$ and $g$ can reveal biases and failure modes of one or both classifiers. The three different variants we observe are: In the case of shape bias of robust models, the generated subpopulation has a schematic appearance but is still part of the true class (left). The zero-shot CLIP classifier extends the original class to a much larger set of out-of-distribution samples which causes unexpected failure modes (middle). When comparing the ViT and the ConvNext models, we find different biases by generating images inside as well as outside of the true class (right).
  • Figure 5: Detection of errors of the zero-shot CLIP model (ImageNet): we generate a SD image with the prompt "a photograph of <CLASSNAME>". Starting from this image, we maximize the difference between the predicted probability for the target class of a zero-shot CLIP ImageNet model and a ConvNeXt-B trained on ImageNet (first row). We find subpopulations of images that are systematically misclassified by the CLIP model: waffles are classified as "waffle iron", stone briges as "steel arch bridges", spoons on a wooden table as "wooden spoon", and images with space and bar as "space bar". In the second row we validate these errors by finding similar real images in LAION-5B (see App. \ref{['app:CLIP']}). The errors of CLIP are most likely an artefact of the text embeddings due to the composition of the class name.
  • ...and 31 more figures