Leveraging Temporal Contextualization for Video Action Recognition
Minji Kim, Dongyoon Han, Taekyung Kim, Bohyung Han
TL;DR
TC-CLIP tackles temporal modeling in video understanding by introducing Temporal Contextualization (TC), which summarizes frame-level tokens into a compact set of context tokens, and Video-conditional Prompting (VP), which injects video context into text prompts. The approach enables global spatio-temporal interactions within a CLIP-based framework and is trained end-to-end with a contrastive video-text objective. Extensive experiments across zero-shot, few-shot, base-to-novel, and fully supervised regimes on multiple benchmarks demonstrate state-of-the-art performance and provide in-depth ablations to justify TC and VP. The work offers practical gains for open-vocabulary video recognition and provides accessible code for replication.
Abstract
We propose a novel framework for video understanding, called Temporally Contextualized CLIP (TC-CLIP), which leverages essential temporal information through global interactions in a spatio-temporal domain within a video. To be specific, we introduce Temporal Contextualization (TC), a layer-wise temporal information infusion mechanism for videos, which 1) extracts core information from each frame, 2) connects relevant information across frames for the summarization into context tokens, and 3) leverages the context tokens for feature encoding. Furthermore, the Video-conditional Prompting (VP) module processes context tokens to generate informative prompts in the text modality. Extensive experiments in zero-shot, few-shot, base-to-novel, and fully-supervised action recognition validate the effectiveness of our model. Ablation studies for TC and VP support our design choices. Our project page with the source code is available at https://github.com/naver-ai/tc-clip
