DEPICT: Diffusion-Enabled Permutation Importance for Image Classification Tasks

Sarah Jabbour; Gregory Kondas; Ella Kazerooni; Michael Sjoding; David Fouhey; Jenna Wiens

DEPICT: Diffusion-Enabled Permutation Importance for Image Classification Tasks

Sarah Jabbour, Gregory Kondas, Ella Kazerooni, Michael Sjoding, David Fouhey, Jenna Wiens

TL;DR

DEPICT addresses the need for global, concept-level explanations of image classifiers by enabling permutation-based explanations in the concept space. It uses a text-space of describable concepts, a diffusion generator $g: \mathcal{T} \to \mathcal{I}$, and a concept detector $h: \mathcal{I} \to \mathcal{T}$ to permute concepts across images and measure the resulting drop in downstream performance, quantified via $a$ and $a_j$ to rank concepts. Across synthetic data, COCO, and MIMIC-CXR, DEPICT consistently outperforms traditional instance-based methods like GradCAM and LIME in recovering ground-truth feature importances, while validating key assumptions of effective generation and independent permutation. The approach enables practical, dataset-level insights into what concepts image classifiers rely on, with potential impact in safety-critical domains such as healthcare.

Abstract

We propose a permutation-based explanation method for image classifiers. Current image-model explanations like activation maps are limited to instance-based explanations in the pixel space, making it difficult to understand global model behavior. In contrast, permutation based explanations for tabular data classifiers measure feature importance by comparing model performance on data before and after permuting a feature. We propose an explanation method for image-based models that permutes interpretable concepts across dataset images. Given a dataset of images labeled with specific concepts like captions, we permute a concept across examples in the text space and then generate images via a text-conditioned diffusion model. Feature importance is then reflected by the change in model performance relative to unpermuted data. When applied to a set of concepts, the method generates a ranking of feature importance. We show this approach recovers underlying model feature importance on synthetic and real-world image classification tasks.

DEPICT: Diffusion-Enabled Permutation Importance for Image Classification Tasks

TL;DR

, and a concept detector

to permute concepts across images and measure the resulting drop in downstream performance, quantified via

and

to rank concepts. Across synthetic data, COCO, and MIMIC-CXR, DEPICT consistently outperforms traditional instance-based methods like GradCAM and LIME in recovering ground-truth feature importances, while validating key assumptions of effective generation and independent permutation. The approach enables practical, dataset-level insights into what concepts image classifiers rely on, with potential impact in safety-critical domains such as healthcare.

Abstract

Paper Structure (19 sections, 18 figures, 9 tables)

This paper contains 19 sections, 18 figures, 9 tables.

Introduction
Related Works
Method
Permutation Importance on Tabular Data
Permutation Importance on Image Data
Experiments & Results
Synthetic Dataset
Real Dataset
DEPICT in Practice: A Case Study in Healthcare
Limitations
Conclusion
Supplementary Materials Overview
Synthetic Validation
Experiments
COCO
...and 4 more sections

Figures (18)

Figure 1: Text-conditioned diffusion enables permutation importance for images. Given images captioned with concepts, we permute concepts across captions. Then, we generate images via text-conditioned diffusion models and measure classifier performance relative to unpermuted data. If performance drops, the model relies on the concept.
Figure 2: Approach overview. In tabular permutation importance (left), one obtains feature importance by permuting each feature column and measuring the impact on model performance. In diffusion-enabled image permutation importance (right), features are permuted in the diffusion model's conditioned text space and generate dataset images for classifier evaluation. To validate results, one can check that the model can accurately classify generated images, and only the permuted concept changed.
Figure 3: Model feature importance across synthetic data models. We compare the DEPICT ranking to GradCAM selvaraju2017grad and LIME ribeiro2016should. Left: DEPICT has higher correlation with the standardized regression weights compared to GradCAM and LIME. Right: ranking generated for 4/100 randomly chosen classifiers. RC: red circle; BC: blue circle; GC: green circle; RR: red rectangle; BR: blue rectangle; GR: green rectangle.
Figure 4: AUROC and top-k accuracy of methods across varying importance thresholds. We plot DEPICT's performance against GradCAM and LIME. Datapoints in the upper left half are DEPICT outperforming GradCAM and LIME, while in the lower half are DEPICT underperforming. Across all three sets of tasks, DEPICT outperforms both GradCAM and LIME in terms of AUROC and top-k accuracy when predicting important concepts across most thresholds.
Figure 5: Generated Images. Examples of generated images where each concept is (upper) or is not (lower) in the caption used to generate the image. The generated images reflect whether or not the concept is included in the caption.
...and 13 more figures

DEPICT: Diffusion-Enabled Permutation Importance for Image Classification Tasks

TL;DR

Abstract

DEPICT: Diffusion-Enabled Permutation Importance for Image Classification Tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (18)