Table of Contents
Fetching ...

Back to the Feature: Explaining Video Classifiers with Video Counterfactual Explanations

Chao Wang, Chengan Che, Xinyue Chen, Sophia Tsoka, Luis C. Garcia-Peraza-Herrera

TL;DR

The paper addresses explaining video classifiers by generating video counterfactual explanations (CFEs) using Back To The Feature (BTTF). It formulates a two-stage optimization on latent diffusion model latents: an inversion stage to anchor near the original video and a counterfactual generation stage guided by the target classifier, augmented with a style loss to ensure realism and a progressive denoising schedule for efficiency. BTTF demonstrates high-quality, temporally coherent CFEs across motion, emotion, and action domains, revealing the classifier’s decision cues and uncovering spurious features in state-of-the-art models. While offering strong interpretability and debugging benefits, the approach notes significant computational demands and calls for future work on general-purpose generators and standardized video CFE metrics.

Abstract

Counterfactual explanations (CFEs) are minimal and semantically meaningful modifications of the input of a model that alter the model predictions. They highlight the decisive features the model relies on, providing contrastive interpretations for classifiers. State-of-the-art visual counterfactual explanation methods are designed to explain image classifiers. The generation of CFEs for video classifiers remains largely underexplored. For the counterfactual videos to be useful, they have to be physically plausible, temporally coherent, and exhibit smooth motion trajectories. Existing CFE image-based methods, designed to explain image classifiers, lack the capacity to generate temporally coherent, smooth and physically plausible video CFEs. To address this, we propose Back To The Feature (BTTF), an optimization framework that generates video CFEs. Our method introduces two novel features, 1) an optimization scheme to retrieve the initial latent noise conditioned by the first frame of the input video, 2) a two-stage optimization strategy to enable the search for counterfactual videos in the vicinity of the input video. Both optimization processes are guided solely by the target classifier, ensuring the explanation is faithful. To accelerate convergence, we also introduce a progressive optimization strategy that incrementally increases the number of denoising steps. Extensive experiments on video datasets such as Shape-Moving (motion classification), MEAD (emotion classification), and NTU RGB+D (action classification) show that our BTTF effectively generates valid, visually similar and realistic counterfactual videos that provide concrete insights into the classifier's decision-making mechanism.

Back to the Feature: Explaining Video Classifiers with Video Counterfactual Explanations

TL;DR

The paper addresses explaining video classifiers by generating video counterfactual explanations (CFEs) using Back To The Feature (BTTF). It formulates a two-stage optimization on latent diffusion model latents: an inversion stage to anchor near the original video and a counterfactual generation stage guided by the target classifier, augmented with a style loss to ensure realism and a progressive denoising schedule for efficiency. BTTF demonstrates high-quality, temporally coherent CFEs across motion, emotion, and action domains, revealing the classifier’s decision cues and uncovering spurious features in state-of-the-art models. While offering strong interpretability and debugging benefits, the approach notes significant computational demands and calls for future work on general-purpose generators and standardized video CFE metrics.

Abstract

Counterfactual explanations (CFEs) are minimal and semantically meaningful modifications of the input of a model that alter the model predictions. They highlight the decisive features the model relies on, providing contrastive interpretations for classifiers. State-of-the-art visual counterfactual explanation methods are designed to explain image classifiers. The generation of CFEs for video classifiers remains largely underexplored. For the counterfactual videos to be useful, they have to be physically plausible, temporally coherent, and exhibit smooth motion trajectories. Existing CFE image-based methods, designed to explain image classifiers, lack the capacity to generate temporally coherent, smooth and physically plausible video CFEs. To address this, we propose Back To The Feature (BTTF), an optimization framework that generates video CFEs. Our method introduces two novel features, 1) an optimization scheme to retrieve the initial latent noise conditioned by the first frame of the input video, 2) a two-stage optimization strategy to enable the search for counterfactual videos in the vicinity of the input video. Both optimization processes are guided solely by the target classifier, ensuring the explanation is faithful. To accelerate convergence, we also introduce a progressive optimization strategy that incrementally increases the number of denoising steps. Extensive experiments on video datasets such as Shape-Moving (motion classification), MEAD (emotion classification), and NTU RGB+D (action classification) show that our BTTF effectively generates valid, visually similar and realistic counterfactual videos that provide concrete insights into the classifier's decision-making mechanism.

Paper Structure

This paper contains 27 sections, 13 equations, 10 figures, 5 tables, 2 algorithms.

Figures (10)

  • Figure 1: Video counterfactual explanations with BTTF. The top row shows an input video, where the facial expression is predicted as "Angry" with 98% confidence by the target video classifier E-swin. To answer "why angry not sad?", our method BTTF (middle row) introduces minimal and semantically meaningful changes to the input video, resulting in the alteration of the model's prediction to "Sad". Similarly, to answer "why angry not happy?", another counterfactual video classified as "Happy" is generated (bottom row). These counterfactual explanations visually highlight the key spatiotemporal features (facial movements) that the classifier relies on to make its decisions.
  • Figure 2: Illustration of the BTTF optimization framework for video CFEs. In the first stage for inversion, the initial latent input $z_T$, which is initially sampled from the Gaussian distribution, is optimized by the backpropagated gradients from the reconstruction loss $\mathcal{L}_I$ between the noise-free latent $\hat{z}_0$ and the original input video latent $z_i$. In the second stage for CFE generation, $\hat{z}_0$ is further decoded by the VAE decoder $\mathcal{D}$ to obtain the generated counterfactual video $\hat{x}_c$, and then fed into the target video classifier (black box) to compute the cross-entropy loss with the target class $y_c$. The cross-entropy loss, together with the video style loss computed from $\hat{x}_c$ and $x_i$, constitutes the objective function $\mathcal{L}_c$, which is used to optimize $z_T$ via gradient backpropagation. In the denoising process of I2V diffusion models, the number of inference steps is set to one in the first stage, while it progressively increases from one to $N$ in the second stage to accelerate convergence of the optimization process.
  • Figure 3: CFE videos generated by BTTF for the target motion classifier M-swin trained on Shape-Moving. BTTF changes M-swin's prediction on the original input video from "Up" to target motion classes "Left", "Down" and "Right", respectively, demonstrating that BTTF is capable of precisely editing pure dynamic features (here, movement directions) to produce CFE videos.
  • Figure 4: CFE videos generated by BTTF for the target emotion classifier E-swin trained on MEAD. BTTF alters E-swin's prediction on the original input video from "Neutral" to target motion classes "Fear", "Contempt" and "Disgust", respectively. The results indicate the strong capacity of BTTF in editing emotion features in a semantically meaningful way.
  • Figure 5: Video CFE generated by BTTF for the target action classifier A-swinR trained on NTU RGB+D. BTTF flips A-swinR's prediction on the original input video from "Hand waving" to target action classes "Taking a selfie", "Pointing" and "Staggering", respectively. The results demonstrate the edits of BTTF for human actions are physically plausible.
  • ...and 5 more figures