Table of Contents
Fetching ...

A Semantic and Motion-Aware Spatiotemporal Transformer Network for Action Detection

Matthew Korban, Peter Youngs, Scott T. Acton

TL;DR

This work tackles spatiotemporal action detection in untrimmed videos by introducing a semantic-and-motion-aware transformer (SMAST) that explicitly models interactions between persons and objects across space and motion. A motion-aware 2D positional encoding, a multi-feature selective semantic attention mechanism, and a sequence-based temporal attention are integrated to capture dynamic spatiotemporal variations and heterogeneous frame dependencies. The approach achieves state-of-the-art results on AVA v2.2, AVA v2.1, UCF101-24, and EPIC-Kitchens, demonstrating improved modeling of action semantics and temporal structure. The method emphasizes selective, multi-feature interactions and motion memory to enhance efficiency and accuracy in complex, real-world videos.

Abstract

This paper presents a novel spatiotemporal transformer network that introduces several original components to detect actions in untrimmed videos. First, the multi-feature selective semantic attention model calculates the correlations between spatial and motion features to model spatiotemporal interactions between different action semantics properly. Second, the motion-aware network encodes the locations of action semantics in video frames utilizing the motion-aware 2D positional encoding algorithm. Such a motion-aware mechanism memorizes the dynamic spatiotemporal variations in action frames that current methods cannot exploit. Third, the sequence-based temporal attention model captures the heterogeneous temporal dependencies in action frames. In contrast to standard temporal attention used in natural language processing, primarily aimed at finding similarities between linguistic words, the proposed sequence-based temporal attention is designed to determine both the differences and similarities between video frames that jointly define the meaning of actions. The proposed approach outperforms the state-of-the-art solutions on four spatiotemporal action datasets: AVA 2.2, AVA 2.1, UCF101-24, and EPIC-Kitchens.

A Semantic and Motion-Aware Spatiotemporal Transformer Network for Action Detection

TL;DR

This work tackles spatiotemporal action detection in untrimmed videos by introducing a semantic-and-motion-aware transformer (SMAST) that explicitly models interactions between persons and objects across space and motion. A motion-aware 2D positional encoding, a multi-feature selective semantic attention mechanism, and a sequence-based temporal attention are integrated to capture dynamic spatiotemporal variations and heterogeneous frame dependencies. The approach achieves state-of-the-art results on AVA v2.2, AVA v2.1, UCF101-24, and EPIC-Kitchens, demonstrating improved modeling of action semantics and temporal structure. The method emphasizes selective, multi-feature interactions and motion memory to enhance efficiency and accuracy in complex, real-world videos.

Abstract

This paper presents a novel spatiotemporal transformer network that introduces several original components to detect actions in untrimmed videos. First, the multi-feature selective semantic attention model calculates the correlations between spatial and motion features to model spatiotemporal interactions between different action semantics properly. Second, the motion-aware network encodes the locations of action semantics in video frames utilizing the motion-aware 2D positional encoding algorithm. Such a motion-aware mechanism memorizes the dynamic spatiotemporal variations in action frames that current methods cannot exploit. Third, the sequence-based temporal attention model captures the heterogeneous temporal dependencies in action frames. In contrast to standard temporal attention used in natural language processing, primarily aimed at finding similarities between linguistic words, the proposed sequence-based temporal attention is designed to determine both the differences and similarities between video frames that jointly define the meaning of actions. The proposed approach outperforms the state-of-the-art solutions on four spatiotemporal action datasets: AVA 2.2, AVA 2.1, UCF101-24, and EPIC-Kitchens.
Paper Structure (40 sections, 28 equations, 9 figures, 14 tables)

This paper contains 40 sections, 28 equations, 9 figures, 14 tables.

Figures (9)

  • Figure 1: The pipeline for the proposed method includes the preprocessing stage and the transformer network. Given the sequence of RGB images, first, the spatial semantics and optical flow fields are extracted in the preprocessing stage. The motion enhancement and segmentation algorithm extracts the motion semantics that are invariant to camera movement. The multi-feature selective attention model captures the correlative patterns between spatial and motion semantics. The motion memory module updates the semantic positional encoding (S-positional encoding) of the transformer network and makes it semantically motion-aware. The multi-feature fusion combines the extracted features and directs them to the deep network. The sequence-based temporal attention model captures the heterogeneous temporal dependencies between different times that are then used to detect the action sequence in the classification and regression stage.
  • Figure 2: The motion-aware positional encoding, (c), compared to the standard one, (a) and (b), in dealing with spatiotemporal action semantic variations: two basketball players, red and green, change their positions from time $t$, (a) to $t + \tau$, (b). So the red player ($p_{11}$ as the offensive player) switched his position to the green player (now $p_{11}$ the defender). The proposed motion-aware positional encoding, (c), can memorize the position changes of two basketball players using the motion memory offsets, $\Delta p_{green}$ and $\Delta p_{red}$. The green and red arrows show the motion vectors obtained from the optical flow fields.
  • Figure 3: Some examples of our multi-feature attention types that capture the spatiotemporal semantic interactions in action samples. (a): spatial-to-spatial attention between the "sitting person" and the "stationary cake" ; (b): motion-to-motion attention between the "moving person" and the "moving jet ski" ; (c): spatial-to-motion attention between the "stationary waiting player" and the "moving pitching player"; (d): motion-to-spatial attention between the "moving jumping player" and the "stationary net". The green arrows show the motion vectors obtained from the optical flow fields. The action samples are collected from the UCF101 dataset soomro2012ucf101.
  • Figure 4: The proposed transformer network includes several modules to capture the multi-feature semantic features. The feature embedding converts the motion and spatial semantics to features. The motion memory module memorizes the semantic position changes and includes them as the motion-aware positional encoding in the semantic multi-feature extraction. The multi-feature selective attention represents the correlations between persons with other persons and the most relevant objects. In this action example, "teacher using an instructional tool", these correlations represent the interactions between the "teacher" and the 'students", and the"teacher" and relevant objects such as the "handheld whiteboard."
  • Figure 5: An example of action "triple jump" illustrates the comparison between the proposed sequence-based temporal attention correlation, $\hat{A}^{corr}$, and the standard temporal attention correlation, $A^{corr}$. Here, $A^{corr}_{i-j}$ and $\hat{A}^{corr}_{i-j}$ represents the temporal correlation between the frame $i$ and $j$. The standard temporal attention tends to give higher values to similar frames, such as $t=1$ and $t=2$, and lower values to distinctive non-adjacent frames, such as $t=3$ and $t=5$ or $t=6$ and $t=9$. In contrast, the proposed temporal attention does not discriminate against frames based on their similarities. Hence, $\hat{A}^{corr}$ is more effective than $A^{corr}$ in representing the temporal dependencies between distinctive and non-adjacent frames that represent keyframes.
  • ...and 4 more figures