Table of Contents
Fetching ...

Understanding Video Transformers via Universal Concept Discovery

Matthew Kowal, Achal Dave, Rares Ambrus, Adrien Gaidon, Konstantinos G. Derpanis, Pavel Tokmakov

TL;DR

This work addresses the interpretability gap of video transformers by introducing VTCD, a framework that unsupervisedly discovers spatiotemporal concepts from transformer representations. It decomposes layerwise features into tubelet-based proposals, clusters them with Convex Non-negative Matrix Factorization to form human-interpretable concepts, and ranks their predictive importance using a robust CRIS approach. The study reveals universal, cross-model concepts (Rosetta concepts) present across supervised and self-supervised video models, with early layers encoding spatiotemporal position, middle layers tracking objects, and later layers handling occlusion reasoning. VTCD enables practical benefits such as targeted pruning of attention heads and improved video object segmentation, highlighting its potential to guide design and optimization of video representation learning.

Abstract

This paper studies the problem of concept-based interpretability of transformer representations for videos. Concretely, we seek to explain the decision-making process of video transformers based on high-level, spatiotemporal concepts that are automatically discovered. Prior research on concept-based interpretability has concentrated solely on image-level tasks. Comparatively, video models deal with the added temporal dimension, increasing complexity and posing challenges in identifying dynamic concepts over time. In this work, we systematically address these challenges by introducing the first Video Transformer Concept Discovery (VTCD) algorithm. To this end, we propose an efficient approach for unsupervised identification of units of video transformer representations - concepts, and ranking their importance to the output of a model. The resulting concepts are highly interpretable, revealing spatio-temporal reasoning mechanisms and object-centric representations in unstructured video models. Performing this analysis jointly over a diverse set of supervised and self-supervised representations, we discover that some of these mechanism are universal in video transformers. Finally, we show that VTCD can be used for fine-grained action recognition and video object segmentation.

Understanding Video Transformers via Universal Concept Discovery

TL;DR

This work addresses the interpretability gap of video transformers by introducing VTCD, a framework that unsupervisedly discovers spatiotemporal concepts from transformer representations. It decomposes layerwise features into tubelet-based proposals, clusters them with Convex Non-negative Matrix Factorization to form human-interpretable concepts, and ranks their predictive importance using a robust CRIS approach. The study reveals universal, cross-model concepts (Rosetta concepts) present across supervised and self-supervised video models, with early layers encoding spatiotemporal position, middle layers tracking objects, and later layers handling occlusion reasoning. VTCD enables practical benefits such as targeted pruning of attention heads and improved video object segmentation, highlighting its potential to guide design and optimization of video representation learning.

Abstract

This paper studies the problem of concept-based interpretability of transformer representations for videos. Concretely, we seek to explain the decision-making process of video transformers based on high-level, spatiotemporal concepts that are automatically discovered. Prior research on concept-based interpretability has concentrated solely on image-level tasks. Comparatively, video models deal with the added temporal dimension, increasing complexity and posing challenges in identifying dynamic concepts over time. In this work, we systematically address these challenges by introducing the first Video Transformer Concept Discovery (VTCD) algorithm. To this end, we propose an efficient approach for unsupervised identification of units of video transformer representations - concepts, and ranking their importance to the output of a model. The resulting concepts are highly interpretable, revealing spatio-temporal reasoning mechanisms and object-centric representations in unstructured video models. Performing this analysis jointly over a diverse set of supervised and self-supervised representations, we discover that some of these mechanism are universal in video transformers. Finally, we show that VTCD can be used for fine-grained action recognition and video object segmentation.
Paper Structure (20 sections, 8 equations, 9 figures, 4 tables)

This paper contains 20 sections, 8 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Heatmap predictions of the TCOW model van2023tracking for tracking through occlusions (top), together with concepts discovered by our VTCD (bottom). We can see that the model encodes positional information in early layers, identifies containers and collision events in mid-layers and tracks through occlusions in late layers. Only one video is shown, but the discovered concepts are shared between many dataset samples (see https://youtu.be/TVfSyDQAb3I for full results).
  • Figure 2: Video Transformer Concept Discovery (VTCD) takes a dataset of videos, $\textbf{X}$, as input and passes them to a model, $f_{[1,l]}$ (shown in yellow). The set of video features, $\textbf{Z}$, are then parsed into spatiotemporal tubelet proposals, $\textbf{T}$ (shown in red), via SLIC clustering in the feature space. Finally, tubelets are clustered across the videos to discover high-level units of network representation - concepts, $\textbf{C}$ (right).
  • Figure 3: A visual representation of concept masking for a single concept. Given a video $\mathbf{x}_i$ and a concept, $c_l$, we mask the tokens of the intermediate representation $\mathbf{z}_i = f_{[1,l]}(\mathbf{x}_i)$ with the concepts' binary support masks, $\textbf{B}_{c_l}$, to obtain the perturbed prediction, $\hat{y}_i$.
  • Figure 4: Attribution curves for every layer of TCOW trained on Kubric (top) and VideoMAE trained on SSv2 (bottom). We remove concepts from most-to-least (left) or least-to-most important (right). CRIS produces better concept importance than methods based on single concept occlusions (Occ) or gradients (IG).
  • Figure 5: The top-3 most important concepts for the TCOW model trained on Kubric (left) and VideoMAE trained on SSv2 for the target class dropping something into something (right). Two videos are shown for each concept and the query object is denoted with a green border in Kubric. For TCOW, the $1^{st}$ and $2^{nd}$ (top-left, middle-left) most important concepts track multiple objects including the target and the distractors. For VideoMAE, the top concept (top-right) captures the object and dropping event (i.e. hand, object and container) while the $2^{nd}$ most important concept (middle-right) captures solely the container. Interestingly, for both models and tasks, the third most important concept (bottom) is a temporally invariant tubelet. See Section \ref{['sec:RosettaExp']} for further discussion (and the https://youtu.be/AsvTkcdvdC4 for full results).
  • ...and 4 more figures