Table of Contents
Fetching ...

CAST: Cross-Attention in Space and Time for Video Action Recognition

Dongho Lee, Jongseo Lee, Jinwoo Choi

TL;DR

This paper tackles the challenge of balanced spatio-temporal understanding in video action recognition by proposing CAST, a two-stream RGB-only architecture that separately models spatial and temporal cues via frozen experts (e.g., CLIP for space and VideoMAE for time) and enables their exchange through a Bottleneck Cross-Attention in Space and Time (B-CAST) module. The B-CAST integrates Temporal-to-Spatial and Spatial-to-Temporal cross-attention within adapters placed in a ViT-like backbone, allowing the two streams to refine each other's predictions without fully fine-tuning large models. Extensive experiments across EPIC-KITCHENS-100, Something-Something-V2, and Kinetics-400 demonstrate that CAST achieves a more balanced performance (as measured by harmonic means across tasks) than individual experts and many baselines, with ablations validating the critical role of bi-directional cross-attention, window shapes, and bottleneck adapters. The approach achieves strong, dataset-agnostic results and highlights the potential of cross-stream attention to unify spatial and temporal representations in RGB-only video understanding, with practical implications for more robust action recognition systems.

Abstract

Recognizing human actions in videos requires spatial and temporal understanding. Most existing action recognition models lack a balanced spatio-temporal understanding of videos. In this work, we propose a novel two-stream architecture, called Cross-Attention in Space and Time (CAST), that achieves a balanced spatio-temporal understanding of videos using only RGB input. Our proposed bottleneck cross-attention mechanism enables the spatial and temporal expert models to exchange information and make synergistic predictions, leading to improved performance. We validate the proposed method with extensive experiments on public benchmarks with different characteristics: EPIC-KITCHENS-100, Something-Something-V2, and Kinetics-400. Our method consistently shows favorable performance across these datasets, while the performance of existing methods fluctuates depending on the dataset characteristics.

CAST: Cross-Attention in Space and Time for Video Action Recognition

TL;DR

This paper tackles the challenge of balanced spatio-temporal understanding in video action recognition by proposing CAST, a two-stream RGB-only architecture that separately models spatial and temporal cues via frozen experts (e.g., CLIP for space and VideoMAE for time) and enables their exchange through a Bottleneck Cross-Attention in Space and Time (B-CAST) module. The B-CAST integrates Temporal-to-Spatial and Spatial-to-Temporal cross-attention within adapters placed in a ViT-like backbone, allowing the two streams to refine each other's predictions without fully fine-tuning large models. Extensive experiments across EPIC-KITCHENS-100, Something-Something-V2, and Kinetics-400 demonstrate that CAST achieves a more balanced performance (as measured by harmonic means across tasks) than individual experts and many baselines, with ablations validating the critical role of bi-directional cross-attention, window shapes, and bottleneck adapters. The approach achieves strong, dataset-agnostic results and highlights the potential of cross-stream attention to unify spatial and temporal representations in RGB-only video understanding, with practical implications for more robust action recognition systems.

Abstract

Recognizing human actions in videos requires spatial and temporal understanding. Most existing action recognition models lack a balanced spatio-temporal understanding of videos. In this work, we propose a novel two-stream architecture, called Cross-Attention in Space and Time (CAST), that achieves a balanced spatio-temporal understanding of videos using only RGB input. Our proposed bottleneck cross-attention mechanism enables the spatial and temporal expert models to exchange information and make synergistic predictions, leading to improved performance. We validate the proposed method with extensive experiments on public benchmarks with different characteristics: EPIC-KITCHENS-100, Something-Something-V2, and Kinetics-400. Our method consistently shows favorable performance across these datasets, while the performance of existing methods fluctuates depending on the dataset characteristics.
Paper Structure (62 sections, 8 equations, 15 figures, 11 tables)

This paper contains 62 sections, 8 equations, 15 figures, 11 tables.

Figures (15)

  • Figure 1: The importance of spatio-temporal understanding. If a model lacks fine-grained spatial understanding, the model may predict an incorrect action. E.g., the model fails to predict Put down a cheese in (a) due to subtle appearance differences between the objects. On the other hand, if a model lacks temporal context understanding, the model may predict an incorrect action. E.g., the model fails to predict Take out a sauce in (b) due to the ambiguity of the action. Therefore, both spatial and temporal understanding are crucial in action recognition. Best viewed with zoom and color.
  • Figure 2: High-level illustration of the proposed method. In this work, we employ spatial and temporal expert models. The two experts exchange information with each other using cross-attention. Initially, the experts may predict incorrect actions due to the lack of information. For example, the temporal expert may predict reach out to something while the ground truth is Pick up a fork. Similarly, the spatial expert may predict utensil holder instead of fork in the shallower layers. However, after using cross-attention to exchange information multiple times, the proposed method can collectively predict the correct action Pick up a fork. Best viewed with zoom and color.
  • Figure 3: Overview of CAST. (a) CAST employs frozen spatial and temporal expert models. On top of the experts, we add a cross-attention module B-CAST to enable the exchange of information between the two experts. Additionally, we employ adapters with a small number of learnable parameters to the experts for better adaptation. (b) The proposed B-CAST consists of temporal-to-spatial (T2S) and spatial-to-temporal (S2T) cross-attentions to allow for a better understanding of the spatio-temporal features in the video data. For efficient and effective learning, we incorporate cross-attention into the bottleneck adpater. We employ separate position embedding for each expert. (c) We visualize T2S and S2T cross-attentions. Given a query, the model attends along the temporal axis only in T2S while the model attends along the spatial axes only in S2T.
  • Figure 4: Balanced spatio-temporal understanding performance. We visualize the action recogntion accuracies of existing methods and the proposed method. (a) We show the Top-1 accuracies of ST-Adapter and VideoMAE on the EK100 verb and noun prediction tasks. (b) We show the Top-1 accuracies of AIM and BEVT on the SSV2, and K400. (c) For each method, we show the harmonic mean of Top-1 accuracies on the EK100 noun, EK100 verb, SSV2, and K400. CAST shows a more balanced spatio-temporal understanding capability compared to the existing methods. Best viewed with zoom and color.
  • Figure 5: Improvements of CAST over each expert on EK100 noun classes. We show the super-category-wise weighted average F1 score improvement of CAST over each expert. (Left) Improvement over CLIP. CAST outperforms CLIP for every super-category except meat and substitute. (Right) Improvement over VideoMAE. CAST outperforms VideoMAE for every super-category except furniture and prepared food. Best viewed with zoom and color.
  • ...and 10 more figures