Table of Contents
Fetching ...

CA^2ST: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition

Jongseo Lee, Joohyun Chang, Dongho Lee, Jinwoo Choi

TL;DR

This work tackles the challenge of balanced spatio-temporal understanding in video action recognition and seeks robust audio-visual integration without heavy pretraining. It introduces CAST, a two-stream Spatial and Temporal transformer with Bottleneck Cross-Attention that enables cross-exchange between space and time using only RGB input. It then extends CAST to CAVA (audio–visual fusion) and CA^2ST (three-expert fusion: audio, space, and time) to achieve holistic video understanding through cross-attention across all modalities. Across diverse datasets, CAST demonstrates balanced performance between spatial and temporal tasks, while CAVA and CA^2ST achieve strong audio-visual and cross-modal results with favorable efficiency, supported by thorough ablations and robustness analyses. The approach offers a modular, adapter-based framework that can serve as a drop-in backbone for spatio-temporal and multi-modal video understanding.

Abstract

We propose Cross-Attention in Audio, Space, and Time (CA^2ST), a transformer-based method for holistic video recognition. Recognizing actions in videos requires both spatial and temporal understanding, yet most existing models lack a balanced spatio-temporal understanding of videos. To address this, we propose a novel two-stream architecture, called Cross-Attention in Space and Time (CAST), using only RGB input. In each layer of CAST, Bottleneck Cross-Attention (B-CA) enables spatial and temporal experts to exchange information and make synergistic predictions. For holistic video understanding, we extend CAST by integrating an audio expert, forming Cross-Attention in Visual and Audio (CAVA). We validate the CAST on benchmarks with different characteristics, EPIC-KITCHENS-100, Something-Something-V2, and Kinetics-400, consistently showing balanced performance. We also validate the CAVA on audio-visual action recognition benchmarks, including UCF-101, VGG-Sound, KineticsSound, and EPIC-SOUNDS. With a favorable performance of CAVA across these datasets, we demonstrate the effective information exchange among multiple experts within the B-CA module. In summary, CA^2ST combines CAST and CAVA by employing spatial, temporal, and audio experts through cross-attention, achieving balanced and holistic video understanding.

CA^2ST: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition

TL;DR

This work tackles the challenge of balanced spatio-temporal understanding in video action recognition and seeks robust audio-visual integration without heavy pretraining. It introduces CAST, a two-stream Spatial and Temporal transformer with Bottleneck Cross-Attention that enables cross-exchange between space and time using only RGB input. It then extends CAST to CAVA (audio–visual fusion) and CA^2ST (three-expert fusion: audio, space, and time) to achieve holistic video understanding through cross-attention across all modalities. Across diverse datasets, CAST demonstrates balanced performance between spatial and temporal tasks, while CAVA and CA^2ST achieve strong audio-visual and cross-modal results with favorable efficiency, supported by thorough ablations and robustness analyses. The approach offers a modular, adapter-based framework that can serve as a drop-in backbone for spatio-temporal and multi-modal video understanding.

Abstract

We propose Cross-Attention in Audio, Space, and Time (CA^2ST), a transformer-based method for holistic video recognition. Recognizing actions in videos requires both spatial and temporal understanding, yet most existing models lack a balanced spatio-temporal understanding of videos. To address this, we propose a novel two-stream architecture, called Cross-Attention in Space and Time (CAST), using only RGB input. In each layer of CAST, Bottleneck Cross-Attention (B-CA) enables spatial and temporal experts to exchange information and make synergistic predictions. For holistic video understanding, we extend CAST by integrating an audio expert, forming Cross-Attention in Visual and Audio (CAVA). We validate the CAST on benchmarks with different characteristics, EPIC-KITCHENS-100, Something-Something-V2, and Kinetics-400, consistently showing balanced performance. We also validate the CAVA on audio-visual action recognition benchmarks, including UCF-101, VGG-Sound, KineticsSound, and EPIC-SOUNDS. With a favorable performance of CAVA across these datasets, we demonstrate the effective information exchange among multiple experts within the B-CA module. In summary, CA^2ST combines CAST and CAVA by employing spatial, temporal, and audio experts through cross-attention, achieving balanced and holistic video understanding.

Paper Structure

This paper contains 68 sections, 10 equations, 9 figures, 22 tables.

Figures (9)

  • Figure 1: Importance of balanced spatio-temporal and audio-visual understanding. (a) A model without fine-grained spatial understanding fails to predict Put down cheese, due to subtle appearance differences between the objects. On the other hand, if a model lacks sufficient temporal context understanding, the model may incorrectly predict an action. For example, the model without temporal context understanding fails to predict Take out sauce since it is hard to distinguish from Put in without knowing order of the events. (b) If a model lacks audio cues, it may struggle to distinguish between different musical instruments: contrabass versus violin because of their visual similarities. On the other hand, when visual cues are missing and only audio cues are available, distinguishing actions such as wash hands and washing dishes becomes challenging due to similar sounds.
  • Figure 2: High-level illustration of CA$^2$ST. C$\text{A}^2$ST employs spatial, temporal, and audio expert models. The three experts exchange information through interactions and they teach each other. In the early stage, the experts might be able predict only partial information due to the lack of comprehensive understanding. After multiple iterations of information exchange among the experts, the proposed method can collectively predict the correct action: Washing hands in the restroom.
  • Figure 3: Overview of CA$^2$ST. (a) CAST employs frozen spatial and temporal experts, connected through bottleneck cross-attention (B-CA) modules that facilitate information exchange between the two paths (S&T). CAVA employs frozen audio and visual (spatial) experts and the two experts exchange information via B-CA modules (A&S). C$\text{A}^2$ST extends this architecture by incorporating three paths (spatial, temporal, and audio) connected through three B-CA modules (A&S, S&T, A&T). For better adaptation, we learn only the small number of parameters from the B-CA modules and adapters while we freeze all the other parameters. (b) For simplicity, we illustrate only the S&T B-CA module, as the other types of B-CA module differ only in the experts. The S&T B-CA module enables temporal-to-spatial (T2S) and spatial-to-temporal (S2T) cross-attentions, facilitating a balanced understanding of spatio-temporal features. To enable efficient and effective learning, we incorporate cross-attention into the bottleneck adapter. We employ separate position embedding for each expert. (c) In T2S, the model attends along the temporal axis only. In contrast, in S2T, the model attends along the spatial axes only.
  • Figure 4: Improvements of CAST over each expert on EK100 noun classes. (a) Improvement over CLIP. CAST outperforms CLIP for every super-category except meat and substitute. (b) Improvement over VideoMAE. CAST outperforms VideoMAE for every super-category except furniture and prepared food.
  • Figure 5: Qualitative examples from EK100 comparing CLIP, VideoMAE, and the proposed CAST. Each expert model shows more accurate predictions in their expertise, but shows weaker performance on the other task. However, CAST consistently shows correct predictions for both tasks, demonstrating the effectiveness of the proposed B-CA module.
  • ...and 4 more figures