Table of Contents
Fetching ...

MaskDiME: Adaptive Masked Diffusion for Precise and Efficient Visual Counterfactual Explanations

Changlu Guo, Anders Nymark Christensen, Anders Bjorholm Dahl, Morten Rieger Hannemose

TL;DR

The proposed MaskDiME framework achieves over 30x faster inference than the baseline method and achieves comparable or state-of-the-art performance across five benchmark datasets spanning diverse visual domains, establishing a practical and generalizable solution for efficient counterfactual explanation.

Abstract

Visual counterfactual explanations aim to reveal the minimal semantic modifications that can alter a model's prediction, providing causal and interpretable insights into deep neural networks. However, existing diffusion-based counterfactual generation methods are often computationally expensive, slow to sample, and imprecise in localizing the modified regions. To address these limitations, we propose MaskDiME, a simple, fast, and effective diffusion framework that unifies semantic consistency and spatial precision through localized sampling. Our approach adaptively focuses on decision-relevant regions to achieve localized and semantically consistent counterfactual generation while preserving high image fidelity. Our training-free framework, MaskDiME, achieves over 30x faster inference than the baseline method and achieves comparable or state-of-the-art performance across five benchmark datasets spanning diverse visual domains, establishing a practical and generalizable solution for efficient counterfactual explanation.

MaskDiME: Adaptive Masked Diffusion for Precise and Efficient Visual Counterfactual Explanations

TL;DR

The proposed MaskDiME framework achieves over 30x faster inference than the baseline method and achieves comparable or state-of-the-art performance across five benchmark datasets spanning diverse visual domains, establishing a practical and generalizable solution for efficient counterfactual explanation.

Abstract

Visual counterfactual explanations aim to reveal the minimal semantic modifications that can alter a model's prediction, providing causal and interpretable insights into deep neural networks. However, existing diffusion-based counterfactual generation methods are often computationally expensive, slow to sample, and imprecise in localizing the modified regions. To address these limitations, we propose MaskDiME, a simple, fast, and effective diffusion framework that unifies semantic consistency and spatial precision through localized sampling. Our approach adaptively focuses on decision-relevant regions to achieve localized and semantically consistent counterfactual generation while preserving high image fidelity. Our training-free framework, MaskDiME, achieves over 30x faster inference than the baseline method and achieves comparable or state-of-the-art performance across five benchmark datasets spanning diverse visual domains, establishing a practical and generalizable solution for efficient counterfactual explanation.
Paper Structure (13 sections, 9 equations, 6 figures, 4 tables)

This paper contains 13 sections, 9 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Previous methods such as DiME jeanneret2022diffusion and FastDiME weng2024fast produce global or scattered edits, whereas our MaskDiME achieves localized, decision-relevant modifications consistent with the classifier's saliency map, computed via SmoothGrad smilkov2017smoothgrad.
  • Figure 2: Overview of MaskDiME. We illustrate a complete counterfactual generation process (No Smile → Smile) with $\tau = 60$, meaning that the diffusion starts from $z_{60} = \tilde{z}_{60}$ and $x_{60} = x$, where $z_t$ and $x_t$ denote the noisy and clean images at step $t$, respectively , and $\tilde{z}_t$ is obtained from the original forward diffusion process (\ref{['eq:forward_diffusion']}) . Each step applies Gradient-Guided Denoising, modeled as $\mathcal{N}\!(\mu_\theta(z_t) - \Sigma_\theta(z_t)\nabla_{z_t},\, \Sigma_\theta(z_t))$, where the gradient $\nabla_{z_t}$ is derived from \ref{['eq:gradients']}. The adaptive masks $M_t^{x} \subseteq M_t^{z}$ are derived from classifier gradients on $x_t$, where top-$k\%$ and top-$\rho k\%$ gradient regions construct $M_t^{z}$ and $M_t^{x}$, respectively ($\rho \in (0,1]$).
  • Figure 3: Heatmap visualization of diffusion trajectories with different masking strategies. Each column shows noisy samples $z_t$ at different timesteps, and each row corresponds to a masking method. The heatmap represents the per-pixel update magnitude during each reverse step (red indicates stronger updates, blue indicates weaker ones). Our Adaptive Dual-mask yields the most focused and semantically consistent updates across diffusion steps.
  • Figure 4: Comparison of methods on the CelebA smile attribute by FID (from \ref{['tab:celeba_results']}), runtime (batch size = 5). The area of the circles indicates the peak GPU memory allocated during the sampling process. MaskDiME is significantly faster than previous methods, while also achieving the lowest FID, and sustaining low GPU usage—approximately one-tenth of that required by ACE and RCSB. See Supplementary Tab. 6 for quantitative results.
  • Figure 5: Qualitative results. Compared with ACE $l_1$, MaskDiME effectively preserves the overall image structure and produces more pronounced counterfactual explanations, with superior performance in semantic consistency, visual realism, and modification precision.
  • ...and 1 more figures