Table of Contents
Fetching ...

SFMViT: SlowFast Meet ViT in Chaotic World

Jiaying Lin, Jiajun Wen, Mengyuan Liu, Jinfu Liu, Baiqiao Yin, Yue Li

TL;DR

This work tackles spatiotemporal action localization in chaotic scenes by enhancing feature extraction and anchor efficiency. It introduces SFMViT, a dual-stream backbone that fuses ViT's global spatiotemporal reasoning with SlowFast's temporal modeling, augmented by a Confidence Pruning Strategy that prunes anchors using a max-heap and capacity search. On Chaotic World, SFMViT achieves a leading $26.62\%$ mAP, outperforming prior methods by a substantial margin and achieving state-of-the-art on the 2024 MMVRAC leaderboard. The approach demonstrates the value of cross-stream backbone fusion and dynamic anchor selection for complex real-world videos.

Abstract

The task of spatiotemporal action localization in chaotic scenes is a challenging task toward advanced video understanding. Paving the way with high-quality video feature extraction and enhancing the precision of detector-predicted anchors can effectively improve model performance. To this end, we propose a high-performance dual-stream spatiotemporal feature extraction network SFMViT with an anchor pruning strategy. The backbone of our SFMViT is composed of ViT and SlowFast with prior knowledge of spatiotemporal action localization, which fully utilizes ViT's excellent global feature extraction capabilities and SlowFast's spatiotemporal sequence modeling capabilities. Secondly, we introduce the confidence maximum heap to prune the anchors detected in each frame of the picture to filter out the effective anchors. These designs enable our SFMViT to achieve a mAP of 26.62% in the Chaotic World dataset, far exceeding existing models. Code is available at https://github.com/jfightyr/SlowFast-Meet-ViT.

SFMViT: SlowFast Meet ViT in Chaotic World

TL;DR

This work tackles spatiotemporal action localization in chaotic scenes by enhancing feature extraction and anchor efficiency. It introduces SFMViT, a dual-stream backbone that fuses ViT's global spatiotemporal reasoning with SlowFast's temporal modeling, augmented by a Confidence Pruning Strategy that prunes anchors using a max-heap and capacity search. On Chaotic World, SFMViT achieves a leading mAP, outperforming prior methods by a substantial margin and achieving state-of-the-art on the 2024 MMVRAC leaderboard. The approach demonstrates the value of cross-stream backbone fusion and dynamic anchor selection for complex real-world videos.

Abstract

The task of spatiotemporal action localization in chaotic scenes is a challenging task toward advanced video understanding. Paving the way with high-quality video feature extraction and enhancing the precision of detector-predicted anchors can effectively improve model performance. To this end, we propose a high-performance dual-stream spatiotemporal feature extraction network SFMViT with an anchor pruning strategy. The backbone of our SFMViT is composed of ViT and SlowFast with prior knowledge of spatiotemporal action localization, which fully utilizes ViT's excellent global feature extraction capabilities and SlowFast's spatiotemporal sequence modeling capabilities. Secondly, we introduce the confidence maximum heap to prune the anchors detected in each frame of the picture to filter out the effective anchors. These designs enable our SFMViT to achieve a mAP of 26.62% in the Chaotic World dataset, far exceeding existing models. Code is available at https://github.com/jfightyr/SlowFast-Meet-ViT.
Paper Structure (15 sections, 3 equations, 3 figures, 2 tables, 1 algorithm)

This paper contains 15 sections, 3 equations, 3 figures, 2 tables, 1 algorithm.

Figures (3)

  • Figure 1: Framework of our proposed SFMViT. The left yellow background box diagram shows the architecture flow of our entire spatiotemporal action localization task, and the right orange background box diagram shows the details of the SFMViT module we propose. The 32-frame input video clip goes through the ViT and SlowFast dual-stream networks to obtain spatiotemporal context features. The coordinate axes illustrate the changes in the spatial size and channel dimensions of the feature maps in the SlowFast branch. Anchors are obtained by YOLO series object detectors based on the detection results at keyframes to get individual features from context features. These features of the target individuals go through high-order relation reasoning by the ACAR module pan2021actor and our proposed Confidence Pruning Strategy before the final action category classification.
  • Figure 2: Difference in action AP between our model and the model with SlowFast and ViT as backbone. The top 5 action categories in terms of absolute value of AP change in the table are labeled with spec change values.
  • Figure 3: Curve of mAP with capacity for SFMViT and ViT*. The envelope outside the curve is centered on the curve, with the curve value ± standard deviation at that point as the upper and lower limits.