Table of Contents
Fetching ...

STAA: Spatio-Temporal Attention Attribution for Real-Time Interpreting Transformer-based Video Models

Zerui Wang, Yan Liu

TL;DR

STAA (Spatio-Temporal Attention Attribution), an XAI method for interpreting video Transformer models, contributes to the growing field of XAI by offering a method for researchers and practitioners to analyze Transformer models.

Abstract

Transformer-based models have achieved state-of-the-art performance in various computer vision tasks, including image and video analysis. However, Transformer's complex architecture and black-box nature pose challenges for explainability, a crucial aspect for real-world applications and scientific inquiry. Current Explainable AI (XAI) methods can only provide one-dimensional feature importance, either spatial or temporal explanation, with significant computational complexity. This paper introduces STAA (Spatio-Temporal Attention Attribution), an XAI method for interpreting video Transformer models. Differ from traditional methods that separately apply image XAI techniques for spatial features or segment contribution analysis for temporal aspects, STAA offers both spatial and temporal information simultaneously from attention values in Transformers. The study utilizes the Kinetics-400 dataset, a benchmark collection of 400 human action classes used for action recognition research. We introduce metrics to quantify explanations. We also apply optimization to enhance STAA's raw output. By implementing dynamic thresholding and attention focusing mechanisms, we improve the signal-to-noise ratio in our explanations, resulting in more precise visualizations and better evaluation results. In terms of computational overhead, our method requires less than 3\% of the computational resources of traditional XAI methods, making it suitable for real-time video XAI analysis applications. STAA contributes to the growing field of XAI by offering a method for researchers and practitioners to analyze Transformer models.

STAA: Spatio-Temporal Attention Attribution for Real-Time Interpreting Transformer-based Video Models

TL;DR

STAA (Spatio-Temporal Attention Attribution), an XAI method for interpreting video Transformer models, contributes to the growing field of XAI by offering a method for researchers and practitioners to analyze Transformer models.

Abstract

Transformer-based models have achieved state-of-the-art performance in various computer vision tasks, including image and video analysis. However, Transformer's complex architecture and black-box nature pose challenges for explainability, a crucial aspect for real-world applications and scientific inquiry. Current Explainable AI (XAI) methods can only provide one-dimensional feature importance, either spatial or temporal explanation, with significant computational complexity. This paper introduces STAA (Spatio-Temporal Attention Attribution), an XAI method for interpreting video Transformer models. Differ from traditional methods that separately apply image XAI techniques for spatial features or segment contribution analysis for temporal aspects, STAA offers both spatial and temporal information simultaneously from attention values in Transformers. The study utilizes the Kinetics-400 dataset, a benchmark collection of 400 human action classes used for action recognition research. We introduce metrics to quantify explanations. We also apply optimization to enhance STAA's raw output. By implementing dynamic thresholding and attention focusing mechanisms, we improve the signal-to-noise ratio in our explanations, resulting in more precise visualizations and better evaluation results. In terms of computational overhead, our method requires less than 3\% of the computational resources of traditional XAI methods, making it suitable for real-time video XAI analysis applications. STAA contributes to the growing field of XAI by offering a method for researchers and practitioners to analyze Transformer models.

Paper Structure

This paper contains 37 sections, 10 equations, 6 figures, 1 table, 3 algorithms.

Figures (6)

  • Figure 1: Flowchart of SHAP-based temporal feature attribution method
  • Figure 2: Flowchart of the LIME-based spatial feature attribution method
  • Figure 3: Overview of Spatio-temporal Attention Attribution (STAA) method for transformer-based video models
  • Figure 4: Example of attention visualization: heatmap overlaid on a video frame, where blue colors indicate regions of higher importance, while red indicate lower importance.
  • Figure 5: Cloud-based architecture for real-time video XAI
  • ...and 1 more figures