STMixer: A One-Stage Sparse Action Detector
Tao Wu, Mengqi Cao, Ziteng Gao, Gangshan Wu, Limin Wang
TL;DR
STMixer tackles the inefficiency and limited context of traditional two-stage action detectors by proposing a one-stage sparse action detector that samples discriminative features from a 4D ($x$-$y$-$t$-$z$) feature space guided by learnable queries. Its two core designs—a query-guided adaptive feature sampling module and a spatio-temporal decoupled feature mixing module—enable flexible feature extraction and robust decoding, realized in two pipelines: STMixer-K for keyframes and STMixer-T for tubelets. Across five challenging benchmarks, STMixer achieves state-of-the-art results with favorable efficiency, largely due to end-to-end training and the absence of a separate actor detector, as well as strong performance in both actor localization and temporal boundary prediction. The approach advances practical video understanding by delivering accurate, context-aware action detection with scalable, end-to-end deployment potential.
Abstract
Traditional video action detectors typically adopt the two-stage pipeline, where a person detector is first employed to generate actor boxes and then 3D RoIAlign is used to extract actor-specific features for classification. This detection paradigm requires multi-stage training and inference, and the feature sampling is constrained inside the box, failing to effectively leverage richer context information outside. Recently, a few query-based action detectors have been proposed to predict action instances in an end-to-end manner. However, they still lack adaptability in feature sampling and decoding, thus suffering from the issues of inferior performance or slower convergence. In this paper, we propose two core designs for a more flexible one-stage sparse action detector. First, we present a query-based adaptive feature sampling module, which endows the detector with the flexibility of mining a group of discriminative features from the entire spatio-temporal domain. Second, we devise a decoupled feature mixing module, which dynamically attends to and mixes video features along the spatial and temporal dimensions respectively for better feature decoding. Based on these designs, we instantiate two detection pipelines, that is, STMixer-K for keyframe action detection and STMixer-T for action tubelet detection. Without bells and whistles, our STMixer detectors obtain state-of-the-art results on five challenging spatio-temporal action detection benchmarks for keyframe action detection or action tube detection.
