Table of Contents
Fetching ...

LD-ViCE: Latent Diffusion Model for Video Counterfactual Explanations

Payal Varshney, Adriano Lucieri, Christoph Balada, Sheraz Ahmed, Andreas Dengel

TL;DR

LD-ViCE introduces a latent-diffusion framework for video counterfactual explanations that is explicitly guided by the target model, enabling temporally coherent and semantically meaningful edits with reduced computational cost. It encodes videos into latent space via a 3D causal VAE, performs classifier-guided diffusion, and refines outputs to suppress artifacts, yielding counterfactuals that align with the model’s decision boundary. Across EchoNet-Dynamic, FERV39k, and Something-Something V2, LD-ViCE achieves state-of-the-art regression and competitive classification performance, with the refinement stage enhancing perceptual quality and realism. The approach advances trust and interpretability in video AI by providing actionable, temporally coherent explanations while highlighting trade-offs between accuracy and visual fidelity in high-stakes domains.

Abstract

Video-based AI systems are increasingly adopted in safety-critical domains such as autonomous driving and healthcare. However, interpreting their decisions remains challenging due to the inherent spatiotemporal complexity of video data and the opacity of deep learning models. Existing explanation techniques often suffer from limited temporal coherence and a lack of actionable causal insights. Current counterfactual explanation methods typically do not incorporate guidance from the target model, reducing semantic fidelity and practical utility. We introduce Latent Diffusion for Video Counterfactual Explanations (LD-ViCE), a novel framework designed to explain the behavior of video-based AI models. Compared to previous approaches, LD-ViCE reduces the computational costs of generating explanations by operating in latent space using a state-of-the-art diffusion model, while producing realistic and interpretable counterfactuals through an additional refinement step. Experiments on three diverse video datasets - EchoNet-Dynamic (cardiac ultrasound), FERV39k (facial expression), and Something-Something V2 (action recognition) with multiple target models covering both classification and regression tasks, demonstrate that LD-ViCE generalizes well and achieves state-of-the-art performance. On the EchoNet-Dynamic dataset, LD-ViCE achieves significantly higher regression accuracy than prior methods and exhibits high temporal consistency, while the refinement stage further improves perceptual quality. Qualitative analyses confirm that LD-ViCE produces semantically meaningful and temporally coherent explanations, providing actionable insights into model behavior. LD-ViCE advances the trustworthiness and interpretability of video-based AI systems through visually coherent counterfactual explanations.

LD-ViCE: Latent Diffusion Model for Video Counterfactual Explanations

TL;DR

LD-ViCE introduces a latent-diffusion framework for video counterfactual explanations that is explicitly guided by the target model, enabling temporally coherent and semantically meaningful edits with reduced computational cost. It encodes videos into latent space via a 3D causal VAE, performs classifier-guided diffusion, and refines outputs to suppress artifacts, yielding counterfactuals that align with the model’s decision boundary. Across EchoNet-Dynamic, FERV39k, and Something-Something V2, LD-ViCE achieves state-of-the-art regression and competitive classification performance, with the refinement stage enhancing perceptual quality and realism. The approach advances trust and interpretability in video AI by providing actionable, temporally coherent explanations while highlighting trade-offs between accuracy and visual fidelity in high-stakes domains.

Abstract

Video-based AI systems are increasingly adopted in safety-critical domains such as autonomous driving and healthcare. However, interpreting their decisions remains challenging due to the inherent spatiotemporal complexity of video data and the opacity of deep learning models. Existing explanation techniques often suffer from limited temporal coherence and a lack of actionable causal insights. Current counterfactual explanation methods typically do not incorporate guidance from the target model, reducing semantic fidelity and practical utility. We introduce Latent Diffusion for Video Counterfactual Explanations (LD-ViCE), a novel framework designed to explain the behavior of video-based AI models. Compared to previous approaches, LD-ViCE reduces the computational costs of generating explanations by operating in latent space using a state-of-the-art diffusion model, while producing realistic and interpretable counterfactuals through an additional refinement step. Experiments on three diverse video datasets - EchoNet-Dynamic (cardiac ultrasound), FERV39k (facial expression), and Something-Something V2 (action recognition) with multiple target models covering both classification and regression tasks, demonstrate that LD-ViCE generalizes well and achieves state-of-the-art performance. On the EchoNet-Dynamic dataset, LD-ViCE achieves significantly higher regression accuracy than prior methods and exhibits high temporal consistency, while the refinement stage further improves perceptual quality. Qualitative analyses confirm that LD-ViCE produces semantically meaningful and temporally coherent explanations, providing actionable insights into model behavior. LD-ViCE advances the trustworthiness and interpretability of video-based AI systems through visually coherent counterfactual explanations.

Paper Structure

This paper contains 42 sections, 16 equations, 29 figures, 8 tables.

Figures (29)

  • Figure 1: Qualitative counterfactual results generated by LD-ViCE on the FERV39k dataset. The first row shows four frames from three original videos predicted as Surprise. The second, third, fourth, and fifth rows display counterfactuals generated for the target emotion classes Angry, Fear, Happy, and Sad, respectively. The generated counterfactuals exhibit distinct and class-consistent facial dynamics corresponding to the desired emotional categories.
  • Figure 2: Overview of the LD-ViCE counterfactual generation process. The factual video $x_f$ is encoded and perturbed to obtain the noisy latent $z_T$ (here, $T=3$), while the conditional text prompt $c$ is embedded via the text encoder $\tau_\delta(c)$. At each guided denoising step $t$, the latent $z_t$ and embedding $\tau_\delta(c)$ are provided to the diffusion model. The denoising model (e.g., Expert Transformer (Ex-Tr)) predicts the noise $\hat{\epsilon}$, which is used in the sampling process to compute the clean latent $v_t$ and the less noisy latent $\tilde{z}_{t-1}$. The clean latent $v_t$ is decoded to produce $\tilde{x}_t$, which is used to estimate classifier gradients, scaled by $\lambda_c$, to compute the updated latent $z_{t-1}$. After the final step, $z_0$ is decoded into the counterfactual video $x_{cf}$. A refinement stage then denoises the same latent $z_T$ without guidance to obtain a clean reference video, from which a mask is computed to suppress diffusion artifacts and produce the final masked counterfactual video $x_{mcf}$.
  • Figure 3: Qualitative comparison of counterfactual explanations on the EchoNet-Dynamics dataset. The first row shows eight frames from the original video, while the subsequent rows present counterfactuals generated using LD-ViCE, LD-ViCE-RA, and 1SCM reynaud2023feature, respectively. Predicted LVEF values are shown on the left. The figure illustrates that LD-ViCE produces visually coherent counterfactuals that more closely match the target regression values compared to prior methods.
  • Figure 4: Qualitative comparison of counterfactuals generated by LD-ViCE and its RA variant on the FERV39K dataset. The figure shows three representative samples, each consisting of four frames from the original video (top row) and the corresponding counterfactuals generated by LD-ViCE and LD-ViCE-RA, with their corresponding difference maps. Original and target emotion classes are indicated on the left side of each example. The LD-ViCE-RA variant focuses more precisely on expression-relevant facial regions while maintaining realistic appearance and temporal coherence.
  • Figure 5: Qualitative counterfactual results on the EchoNet-Dynamic dataset. Eight frames of a video are displayed. Factual frames and their guidance-free denoised versions are shown with difference maps (Denoised-Diff. Map) visualizing diffusion-induced changes. Classifier-guided counterfactuals generated by LD-ViCE introduce targeted adjustments, highlighted in the difference maps (Diff. Map). The RA variant suppresses high-frequency artifacts, yielding cleaner counterfactuals, with difference maps (RA-Diff. Map) showing only the salient, causal changes.
  • ...and 24 more figures