LD-ViCE: Latent Diffusion Model for Video Counterfactual Explanations

Payal Varshney; Adriano Lucieri; Christoph Balada; Sheraz Ahmed; Andreas Dengel

LD-ViCE: Latent Diffusion Model for Video Counterfactual Explanations

Payal Varshney, Adriano Lucieri, Christoph Balada, Sheraz Ahmed, Andreas Dengel

TL;DR

LD-ViCE introduces a latent-diffusion framework for video counterfactual explanations that is explicitly guided by the target model, enabling temporally coherent and semantically meaningful edits with reduced computational cost. It encodes videos into latent space via a 3D causal VAE, performs classifier-guided diffusion, and refines outputs to suppress artifacts, yielding counterfactuals that align with the model’s decision boundary. Across EchoNet-Dynamic, FERV39k, and Something-Something V2, LD-ViCE achieves state-of-the-art regression and competitive classification performance, with the refinement stage enhancing perceptual quality and realism. The approach advances trust and interpretability in video AI by providing actionable, temporally coherent explanations while highlighting trade-offs between accuracy and visual fidelity in high-stakes domains.

Abstract

Video-based AI systems are increasingly adopted in safety-critical domains such as autonomous driving and healthcare. However, interpreting their decisions remains challenging due to the inherent spatiotemporal complexity of video data and the opacity of deep learning models. Existing explanation techniques often suffer from limited temporal coherence and a lack of actionable causal insights. Current counterfactual explanation methods typically do not incorporate guidance from the target model, reducing semantic fidelity and practical utility. We introduce Latent Diffusion for Video Counterfactual Explanations (LD-ViCE), a novel framework designed to explain the behavior of video-based AI models. Compared to previous approaches, LD-ViCE reduces the computational costs of generating explanations by operating in latent space using a state-of-the-art diffusion model, while producing realistic and interpretable counterfactuals through an additional refinement step. Experiments on three diverse video datasets - EchoNet-Dynamic (cardiac ultrasound), FERV39k (facial expression), and Something-Something V2 (action recognition) with multiple target models covering both classification and regression tasks, demonstrate that LD-ViCE generalizes well and achieves state-of-the-art performance. On the EchoNet-Dynamic dataset, LD-ViCE achieves significantly higher regression accuracy than prior methods and exhibits high temporal consistency, while the refinement stage further improves perceptual quality. Qualitative analyses confirm that LD-ViCE produces semantically meaningful and temporally coherent explanations, providing actionable insights into model behavior. LD-ViCE advances the trustworthiness and interpretability of video-based AI systems through visually coherent counterfactual explanations.

LD-ViCE: Latent Diffusion Model for Video Counterfactual Explanations

TL;DR

Abstract

LD-ViCE: Latent Diffusion Model for Video Counterfactual Explanations

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (29)