Table of Contents
Fetching ...

Exploring Explainability in Video Action Recognition

Avinab Saha, Shashank Gupta, Sravan Kumar Ankireddy, Karl Chahine, Joydeep Ghosh

TL;DR

The paper addresses the gap in explainability for video action recognition by extending Grad-CAM to video models and introducing Video-TCAV, a post-hoc framework that quantifies the influence of high-level concepts via Concept Activation Vectors. Concepts are generated in two forms—spatial and spatiotemporal—using an automated YOLO-v7 pipeline with manual verification, and evaluated on a Video Swin Transformer trained on Kinetics-400. Findings show that dynamic spatiotemporal concepts provide stronger, layer-dependent explanations than static concepts, with statistical significance after Bonferroni correction. This work offers a scalable framework for hypothesis testing in action recognition and points to future directions, including broader model evaluations and diffusion-based concept generation to enhance explainability in video models.

Abstract

Image Classification and Video Action Recognition are perhaps the two most foundational tasks in computer vision. Consequently, explaining the inner workings of trained deep neural networks is of prime importance. While numerous efforts focus on explaining the decisions of trained deep neural networks in image classification, exploration in the domain of its temporal version, video action recognition, has been scant. In this work, we take a deeper look at this problem. We begin by revisiting Grad-CAM, one of the popular feature attribution methods for Image Classification, and its extension to Video Action Recognition tasks and examine the method's limitations. To address these, we introduce Video-TCAV, by building on TCAV for Image Classification tasks, which aims to quantify the importance of specific concepts in the decision-making process of Video Action Recognition models. As the scalable generation of concepts is still an open problem, we propose a machine-assisted approach to generate spatial and spatiotemporal concepts relevant to Video Action Recognition for testing Video-TCAV. We then establish the importance of temporally-varying concepts by demonstrating the superiority of dynamic spatiotemporal concepts over trivial spatial concepts. In conclusion, we introduce a framework for investigating hypotheses in action recognition and quantitatively testing them, thus advancing research in the explainability of deep neural networks used in video action recognition.

Exploring Explainability in Video Action Recognition

TL;DR

The paper addresses the gap in explainability for video action recognition by extending Grad-CAM to video models and introducing Video-TCAV, a post-hoc framework that quantifies the influence of high-level concepts via Concept Activation Vectors. Concepts are generated in two forms—spatial and spatiotemporal—using an automated YOLO-v7 pipeline with manual verification, and evaluated on a Video Swin Transformer trained on Kinetics-400. Findings show that dynamic spatiotemporal concepts provide stronger, layer-dependent explanations than static concepts, with statistical significance after Bonferroni correction. This work offers a scalable framework for hypothesis testing in action recognition and points to future directions, including broader model evaluations and diffusion-based concept generation to enhance explainability in video models.

Abstract

Image Classification and Video Action Recognition are perhaps the two most foundational tasks in computer vision. Consequently, explaining the inner workings of trained deep neural networks is of prime importance. While numerous efforts focus on explaining the decisions of trained deep neural networks in image classification, exploration in the domain of its temporal version, video action recognition, has been scant. In this work, we take a deeper look at this problem. We begin by revisiting Grad-CAM, one of the popular feature attribution methods for Image Classification, and its extension to Video Action Recognition tasks and examine the method's limitations. To address these, we introduce Video-TCAV, by building on TCAV for Image Classification tasks, which aims to quantify the importance of specific concepts in the decision-making process of Video Action Recognition models. As the scalable generation of concepts is still an open problem, we propose a machine-assisted approach to generate spatial and spatiotemporal concepts relevant to Video Action Recognition for testing Video-TCAV. We then establish the importance of temporally-varying concepts by demonstrating the superiority of dynamic spatiotemporal concepts over trivial spatial concepts. In conclusion, we introduce a framework for investigating hypotheses in action recognition and quantitatively testing them, thus advancing research in the explainability of deep neural networks used in video action recognition.
Paper Structure (11 sections, 10 figures)

This paper contains 11 sections, 10 figures.

Figures (10)

  • Figure 1: Grad-CAM outputs with respect to the class playing tennis for Video Swin Transformer model. (a): Grad-CAM correctly highlights the regions of movement for tennis rackets. (b): Grad-CAM focuses on the tennis court in the background and ignores the players in the frame.
  • Figure 2: TCAV process in Image Classification. Image taken from kim2018interpretability. Best viewed zoomed.
  • Figure 3: Video Swin Transformer block diagram. The 3 layers with red arrows whose activations we study while testing CAVs are marked. Best viewed zoomed.
  • Figure 4: YOLO-v7 detections on a frame of a video from the playing tennis class. Best viewed zoomed.
  • Figure 5: Exemplars of Spatial Concepts: person playing tennis, tennis racket, and sports ball generated from Kinetics-400 dataset.
  • ...and 5 more figures