Concepts in Motion: Temporal Bottlenecks for Interpretable Video Classification

Patrick Knab; Sascha Marton; Philipp J. Schubert; Drago Guggiana; Christian Bartelt

Concepts in Motion: Temporal Bottlenecks for Interpretable Video Classification

Patrick Knab, Sascha Marton, Philipp J. Schubert, Drago Guggiana, Christian Bartelt

TL;DR

A transformer-based concept architecture that operates on sequences of temporally grounded concept activations, by employing per-concept temporal self-attention to model when individual concepts recur and how their temporal patterns contribute to predictions.

Abstract

Concept Bottleneck Models (CBMs) enable interpretable image classification by structuring predictions around human-understandable concepts, but extending this paradigm to video remains challenging due to the difficulty of extracting concepts and modeling them over time. In this paper, we introduce $\textbf{MoTIF}$ (Moving Temporal Interpretable Framework), a transformer-based concept architecture that operates on sequences of temporally grounded concept activations, by employing per-concept temporal self-attention to model when individual concepts recur and how their temporal patterns contribute to predictions. Central to the framework is an agentic concept discovery module to automatically extract object- and action-centric textual concepts from videos, yielding temporally expressive concept sets without manual supervision. Across multiple video benchmarks, this combination substantially narrows the performance gap between interpretable and black-box video models while maintaining faithful and temporally grounded concept explanations. Code available at $\href{https://github.com/patrick-knab/MoTIF}{github.com/patrick-knab/MoTIF}$.

Concepts in Motion: Temporal Bottlenecks for Interpretable Video Classification

TL;DR

Abstract

(Moving Temporal Interpretable Framework), a transformer-based concept architecture that operates on sequences of temporally grounded concept activations, by employing per-concept temporal self-attention to model when individual concepts recur and how their temporal patterns contribute to predictions. Central to the framework is an agentic concept discovery module to automatically extract object- and action-centric textual concepts from videos, yielding temporally expressive concept sets without manual supervision. Across multiple video benchmarks, this combination substantially narrows the performance gap between interpretable and black-box video models while maintaining faithful and temporally grounded concept explanations. Code available at

Concepts in Motion: Temporal Bottlenecks for Interpretable Video Classification

TL;DR

Abstract

Concepts in Motion: Temporal Bottlenecks for Interpretable Video Classification

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (14)