Table of Contents
Fetching ...

Extending Information Bottleneck Attribution to Video Sequences

Veronika Solopova, Lucas Schmidt, Dorothea Kolossa

TL;DR

This work extends the Information Bottleneck Attribution (IBA) framework to video data, forming Video Information Bottleneck Attribution (VIBA) to produce spatiotemporal explanations for video classifiers. It implements a dual-path architecture with Xception for spatial cues and a VGG11-based optical-flow model for motion, applying a bottleneck to control information flow and generate informative relevance maps. On a diverse deepfake dataset, VIBA delivers temporally and spatially coherent explanations with modest alignment to human annotations while preserving predictive performance and improving calibration. Overall, VIBA broadens information-theoretic attribution to temporal data and enables interpretable video analysis across architectures and tasks.

Abstract

We introduce VIBA, a novel approach for explainable video classification by adapting Information Bottlenecks for Attribution (IBA) to video sequences. While most traditional explainability methods are designed for image models, our IBA framework addresses the need for explainability in temporal models used for video analysis. To demonstrate its effectiveness, we apply VIBA to video deepfake detection, testing it on two architectures: the Xception model for spatial features and a VGG11-based model for capturing motion dynamics through optical flow. Using a custom dataset that reflects recent deepfake generation techniques, we adapt IBA to create relevance and optical flow maps, visually highlighting manipulated regions and motion inconsistencies. Our results show that VIBA generates temporally and spatially consistent explanations, which align closely with human annotations, thus providing interpretability for video classification and particularly for deepfake detection.

Extending Information Bottleneck Attribution to Video Sequences

TL;DR

This work extends the Information Bottleneck Attribution (IBA) framework to video data, forming Video Information Bottleneck Attribution (VIBA) to produce spatiotemporal explanations for video classifiers. It implements a dual-path architecture with Xception for spatial cues and a VGG11-based optical-flow model for motion, applying a bottleneck to control information flow and generate informative relevance maps. On a diverse deepfake dataset, VIBA delivers temporally and spatially coherent explanations with modest alignment to human annotations while preserving predictive performance and improving calibration. Overall, VIBA broadens information-theoretic attribution to temporal data and enables interpretable video analysis across architectures and tasks.

Abstract

We introduce VIBA, a novel approach for explainable video classification by adapting Information Bottlenecks for Attribution (IBA) to video sequences. While most traditional explainability methods are designed for image models, our IBA framework addresses the need for explainability in temporal models used for video analysis. To demonstrate its effectiveness, we apply VIBA to video deepfake detection, testing it on two architectures: the Xception model for spatial features and a VGG11-based model for capturing motion dynamics through optical flow. Using a custom dataset that reflects recent deepfake generation techniques, we adapt IBA to create relevance and optical flow maps, visually highlighting manipulated regions and motion inconsistencies. Our results show that VIBA generates temporally and spatially consistent explanations, which align closely with human annotations, thus providing interpretability for video classification and particularly for deepfake detection.

Paper Structure

This paper contains 18 sections, 2 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Implemented VIBA pipeline.
  • Figure 2: Comparison of frequencies for human annotations, VGG and Xception VIBA matches for different face regions. Abbreviations: L: Lips & Mouth, E: Deepfake Edges, B: Brows, Eyes & Forehead, N: Nose, Cheeks & Ears, O: Outside of Face, C: Chin & Neck.
  • Figure 3: Pre-processing pipeline for optical flow maps.
  • Figure 4: Relevance maps generated with Xception and with bottleneck injection after block 4.
  • Figure 5: Relevance maps generated with Xception and with bottleneck injection after bn3.
  • ...and 6 more figures