SFMViT: SlowFast Meet ViT in Chaotic World
Jiaying Lin, Jiajun Wen, Mengyuan Liu, Jinfu Liu, Baiqiao Yin, Yue Li
TL;DR
This work tackles spatiotemporal action localization in chaotic scenes by enhancing feature extraction and anchor efficiency. It introduces SFMViT, a dual-stream backbone that fuses ViT's global spatiotemporal reasoning with SlowFast's temporal modeling, augmented by a Confidence Pruning Strategy that prunes anchors using a max-heap and capacity search. On Chaotic World, SFMViT achieves a leading $26.62\%$ mAP, outperforming prior methods by a substantial margin and achieving state-of-the-art on the 2024 MMVRAC leaderboard. The approach demonstrates the value of cross-stream backbone fusion and dynamic anchor selection for complex real-world videos.
Abstract
The task of spatiotemporal action localization in chaotic scenes is a challenging task toward advanced video understanding. Paving the way with high-quality video feature extraction and enhancing the precision of detector-predicted anchors can effectively improve model performance. To this end, we propose a high-performance dual-stream spatiotemporal feature extraction network SFMViT with an anchor pruning strategy. The backbone of our SFMViT is composed of ViT and SlowFast with prior knowledge of spatiotemporal action localization, which fully utilizes ViT's excellent global feature extraction capabilities and SlowFast's spatiotemporal sequence modeling capabilities. Secondly, we introduce the confidence maximum heap to prune the anchors detected in each frame of the picture to filter out the effective anchors. These designs enable our SFMViT to achieve a mAP of 26.62% in the Chaotic World dataset, far exceeding existing models. Code is available at https://github.com/jfightyr/SlowFast-Meet-ViT.
