CAST: Cross-Attention in Space and Time for Video Action Recognition
Dongho Lee, Jongseo Lee, Jinwoo Choi
TL;DR
This paper tackles the challenge of balanced spatio-temporal understanding in video action recognition by proposing CAST, a two-stream RGB-only architecture that separately models spatial and temporal cues via frozen experts (e.g., CLIP for space and VideoMAE for time) and enables their exchange through a Bottleneck Cross-Attention in Space and Time (B-CAST) module. The B-CAST integrates Temporal-to-Spatial and Spatial-to-Temporal cross-attention within adapters placed in a ViT-like backbone, allowing the two streams to refine each other's predictions without fully fine-tuning large models. Extensive experiments across EPIC-KITCHENS-100, Something-Something-V2, and Kinetics-400 demonstrate that CAST achieves a more balanced performance (as measured by harmonic means across tasks) than individual experts and many baselines, with ablations validating the critical role of bi-directional cross-attention, window shapes, and bottleneck adapters. The approach achieves strong, dataset-agnostic results and highlights the potential of cross-stream attention to unify spatial and temporal representations in RGB-only video understanding, with practical implications for more robust action recognition systems.
Abstract
Recognizing human actions in videos requires spatial and temporal understanding. Most existing action recognition models lack a balanced spatio-temporal understanding of videos. In this work, we propose a novel two-stream architecture, called Cross-Attention in Space and Time (CAST), that achieves a balanced spatio-temporal understanding of videos using only RGB input. Our proposed bottleneck cross-attention mechanism enables the spatial and temporal expert models to exchange information and make synergistic predictions, leading to improved performance. We validate the proposed method with extensive experiments on public benchmarks with different characteristics: EPIC-KITCHENS-100, Something-Something-V2, and Kinetics-400. Our method consistently shows favorable performance across these datasets, while the performance of existing methods fluctuates depending on the dataset characteristics.
