Leveraging Temporal Contextualization for Video Action Recognition

Minji Kim; Dongyoon Han; Taekyung Kim; Bohyung Han

Leveraging Temporal Contextualization for Video Action Recognition

Minji Kim, Dongyoon Han, Taekyung Kim, Bohyung Han

TL;DR

TC-CLIP tackles temporal modeling in video understanding by introducing Temporal Contextualization (TC), which summarizes frame-level tokens into a compact set of context tokens, and Video-conditional Prompting (VP), which injects video context into text prompts. The approach enables global spatio-temporal interactions within a CLIP-based framework and is trained end-to-end with a contrastive video-text objective. Extensive experiments across zero-shot, few-shot, base-to-novel, and fully supervised regimes on multiple benchmarks demonstrate state-of-the-art performance and provide in-depth ablations to justify TC and VP. The work offers practical gains for open-vocabulary video recognition and provides accessible code for replication.

Abstract

We propose a novel framework for video understanding, called Temporally Contextualized CLIP (TC-CLIP), which leverages essential temporal information through global interactions in a spatio-temporal domain within a video. To be specific, we introduce Temporal Contextualization (TC), a layer-wise temporal information infusion mechanism for videos, which 1) extracts core information from each frame, 2) connects relevant information across frames for the summarization into context tokens, and 3) leverages the context tokens for feature encoding. Furthermore, the Video-conditional Prompting (VP) module processes context tokens to generate informative prompts in the text modality. Extensive experiments in zero-shot, few-shot, base-to-novel, and fully-supervised action recognition validate the effectiveness of our model. Ablation studies for TC and VP support our design choices. Our project page with the source code is available at https://github.com/naver-ai/tc-clip

Leveraging Temporal Contextualization for Video Action Recognition

TL;DR

Abstract

Paper Structure (21 sections, 11 equations, 12 figures, 18 tables)

This paper contains 21 sections, 11 equations, 12 figures, 18 tables.

Introduction
Proposed Method
Preliminary
Motivation
Temporal Contextualization (TC)
Video-conditional Prompting (VP)
Training Objective
Experiments
Quantitative Comparison
Analysis and Discussion
Related Work
Conclusion
Fine-tuning with the Kinetics-400 Pretrained Model
More Ablation Study on VP
Scalability with ViT-L/14
...and 6 more sections

Figures (12)

Figure 1: Comparison of attention maps between various temporal modeling approaches. Both (a) and (b) fail to recognize actions in the latter frames, whereas (c) exhibits weak discriminability due to sparse attention on the background. In contrast, (d) our method successfully focuses on informative regions across all frames, leading to the accurate action recognition result.
Figure 2: Temporal information learning methods. Prior works consider temporal cues during the encoding process via (a) cross-frame attention xclipvitaclip with [CLS] token interactions or (b) temporal window expansion openvclip by adding adjacent frame tokens to key-value pairs. However, the former lacks patch-level interactions, while the latter limits the range of temporal interactions. (c) Joint space-time attention allows full interactions across all tokens, but it is costly and suboptimal in practice (see Fig. \ref{['fig:pitfall_joint_attention']}.) (d) Unlike prior approaches, our method aggregates pivotal tokens from a broader range yet efficiently for enhanced temporal integration into key-value pairs.
Figure 3: Pitfall of joint space-time attention. (a) Extending CLIP's temporal sequence length degrades attention quality, presumably because it was not trained on such long sequences. (b) We compare the action recognition performance in the few-shot setting on diverse datasets. All existing methods fall behind our method.
Figure 4: Overview of Temporal Contextualization (TC). We first collect informative tokens from each frame and then aggregate relevant seed tokens to obtain context tokens. They are used as key-value pairs for the self-attention in the next layer.
Figure 5: Video-conditional Prompting (VP) module. Video information from the context tokens is injected into the text prompt vectors using a cross-attention mechanism, generating instance-level prompts that make up for the lack of textual semantics.
...and 7 more figures

Leveraging Temporal Contextualization for Video Action Recognition

TL;DR

Abstract

Leveraging Temporal Contextualization for Video Action Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (12)