CA^2ST: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition
Jongseo Lee, Joohyun Chang, Dongho Lee, Jinwoo Choi
TL;DR
This work tackles the challenge of balanced spatio-temporal understanding in video action recognition and seeks robust audio-visual integration without heavy pretraining. It introduces CAST, a two-stream Spatial and Temporal transformer with Bottleneck Cross-Attention that enables cross-exchange between space and time using only RGB input. It then extends CAST to CAVA (audio–visual fusion) and CA^2ST (three-expert fusion: audio, space, and time) to achieve holistic video understanding through cross-attention across all modalities. Across diverse datasets, CAST demonstrates balanced performance between spatial and temporal tasks, while CAVA and CA^2ST achieve strong audio-visual and cross-modal results with favorable efficiency, supported by thorough ablations and robustness analyses. The approach offers a modular, adapter-based framework that can serve as a drop-in backbone for spatio-temporal and multi-modal video understanding.
Abstract
We propose Cross-Attention in Audio, Space, and Time (CA^2ST), a transformer-based method for holistic video recognition. Recognizing actions in videos requires both spatial and temporal understanding, yet most existing models lack a balanced spatio-temporal understanding of videos. To address this, we propose a novel two-stream architecture, called Cross-Attention in Space and Time (CAST), using only RGB input. In each layer of CAST, Bottleneck Cross-Attention (B-CA) enables spatial and temporal experts to exchange information and make synergistic predictions. For holistic video understanding, we extend CAST by integrating an audio expert, forming Cross-Attention in Visual and Audio (CAVA). We validate the CAST on benchmarks with different characteristics, EPIC-KITCHENS-100, Something-Something-V2, and Kinetics-400, consistently showing balanced performance. We also validate the CAVA on audio-visual action recognition benchmarks, including UCF-101, VGG-Sound, KineticsSound, and EPIC-SOUNDS. With a favorable performance of CAVA across these datasets, we demonstrate the effective information exchange among multiple experts within the B-CA module. In summary, CA^2ST combines CAST and CAVA by employing spatial, temporal, and audio experts through cross-attention, achieving balanced and holistic video understanding.
