Table of Contents
Fetching ...

STMixer: A One-Stage Sparse Action Detector

Tao Wu, Mengqi Cao, Ziteng Gao, Gangshan Wu, Limin Wang

TL;DR

STMixer tackles the inefficiency and limited context of traditional two-stage action detectors by proposing a one-stage sparse action detector that samples discriminative features from a 4D ($x$-$y$-$t$-$z$) feature space guided by learnable queries. Its two core designs—a query-guided adaptive feature sampling module and a spatio-temporal decoupled feature mixing module—enable flexible feature extraction and robust decoding, realized in two pipelines: STMixer-K for keyframes and STMixer-T for tubelets. Across five challenging benchmarks, STMixer achieves state-of-the-art results with favorable efficiency, largely due to end-to-end training and the absence of a separate actor detector, as well as strong performance in both actor localization and temporal boundary prediction. The approach advances practical video understanding by delivering accurate, context-aware action detection with scalable, end-to-end deployment potential.

Abstract

Traditional video action detectors typically adopt the two-stage pipeline, where a person detector is first employed to generate actor boxes and then 3D RoIAlign is used to extract actor-specific features for classification. This detection paradigm requires multi-stage training and inference, and the feature sampling is constrained inside the box, failing to effectively leverage richer context information outside. Recently, a few query-based action detectors have been proposed to predict action instances in an end-to-end manner. However, they still lack adaptability in feature sampling and decoding, thus suffering from the issues of inferior performance or slower convergence. In this paper, we propose two core designs for a more flexible one-stage sparse action detector. First, we present a query-based adaptive feature sampling module, which endows the detector with the flexibility of mining a group of discriminative features from the entire spatio-temporal domain. Second, we devise a decoupled feature mixing module, which dynamically attends to and mixes video features along the spatial and temporal dimensions respectively for better feature decoding. Based on these designs, we instantiate two detection pipelines, that is, STMixer-K for keyframe action detection and STMixer-T for action tubelet detection. Without bells and whistles, our STMixer detectors obtain state-of-the-art results on five challenging spatio-temporal action detection benchmarks for keyframe action detection or action tube detection.

STMixer: A One-Stage Sparse Action Detector

TL;DR

STMixer tackles the inefficiency and limited context of traditional two-stage action detectors by proposing a one-stage sparse action detector that samples discriminative features from a 4D (---) feature space guided by learnable queries. Its two core designs—a query-guided adaptive feature sampling module and a spatio-temporal decoupled feature mixing module—enable flexible feature extraction and robust decoding, realized in two pipelines: STMixer-K for keyframes and STMixer-T for tubelets. Across five challenging benchmarks, STMixer achieves state-of-the-art results with favorable efficiency, largely due to end-to-end training and the absence of a separate actor detector, as well as strong performance in both actor localization and temporal boundary prediction. The approach advances practical video understanding by delivering accurate, context-aware action detection with scalable, end-to-end deployment potential.

Abstract

Traditional video action detectors typically adopt the two-stage pipeline, where a person detector is first employed to generate actor boxes and then 3D RoIAlign is used to extract actor-specific features for classification. This detection paradigm requires multi-stage training and inference, and the feature sampling is constrained inside the box, failing to effectively leverage richer context information outside. Recently, a few query-based action detectors have been proposed to predict action instances in an end-to-end manner. However, they still lack adaptability in feature sampling and decoding, thus suffering from the issues of inferior performance or slower convergence. In this paper, we propose two core designs for a more flexible one-stage sparse action detector. First, we present a query-based adaptive feature sampling module, which endows the detector with the flexibility of mining a group of discriminative features from the entire spatio-temporal domain. Second, we devise a decoupled feature mixing module, which dynamically attends to and mixes video features along the spatial and temporal dimensions respectively for better feature decoding. Based on these designs, we instantiate two detection pipelines, that is, STMixer-K for keyframe action detection and STMixer-T for action tubelet detection. Without bells and whistles, our STMixer detectors obtain state-of-the-art results on five challenging spatio-temporal action detection benchmarks for keyframe action detection or action tube detection.
Paper Structure (21 sections, 16 equations, 9 figures, 8 tables)

This paper contains 21 sections, 16 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Comparison of mAP versus GFLOPs between different keyframe action detectors on AVA v2.2. The GFLOPs of CSN, SlowFast, and VideoMAE are the sum of Faster RCNN-R101-FPN detector GFLOPs and classifier GFLOPs. Different methods are marked by different makers and models with the same backbone are marked in the same color. The results of CSN are from TubeR. Our STMixer-K achieves the best effectiveness and efficiency balance.
  • Figure 2: Comparisons between keyframe action detectors and action tubelet detectors in generating video-level action tubes. We show two failure cases of the keyframe action detectors. On the left, the person moves so fast that the two boxes in frames 2 and 3 have a small IoU, which causes the action instance to be mistakenly split into 2 tubes. On the right, a missed detection in frame 3 causes a linking failure. However, linking tubelets produces correct action tubes in both cases.
  • Figure 3: Pipeline of STMixer-K for keyframe action detection. On the left is the overall architecture of STMixer-K. A 4D feature space is constructed on the feature maps of the input video clip. The action decoder contains $M$ stacked ASAM modules. In each module, adaptive feature sampling is performed first. Specifically, a group of offsets is generated on each spatial query by a linear layer and added to the center point of the corresponding positional query, thus yielding sampling points on the keyframe. The sampling points are then temporally propagated and used as the index for feature sampling from the 4D feature space. After feature sampling, adaptive feature mixing is performed. For spatial mixing, temporal pooling is applied to the sampled features. The mixing parameters are generated on each spatial query by a linear layer. Channel and point mixing are performed sequentially. Finally, the mixed feature is transformed by a linear layer and used to update the spatial query. Temporal mixing is performed symmetrically. Optionally, a short-term or long-term classifier can be used for action score prediction, whose detailed structures are illustrated on the right. The long-term classifier refers to the query bank produced by an offline STMixer-K for long-term information.
  • Figure 4: 4D feature space construction for hierarchical video backbone. We construct 4D feature space on multi-scale 3D feature maps from hierarchical video backbone by simple lateral convolution and nearest-neighbor interpolation. The four dimensions of the 4D feature space are x-, y-, t-axis, and scale index z.
  • Figure 5: Pipeline of STMixer-T for action tubelet detection. A video clip composed of $T$ consecutive frames is input to the video backbone for feature extraction. A 4D feature space is constructed on the feature maps. We perform adaptive feature sampling and spatial mixing under the guidance of each spatial query in its corresponding frame feature space. A parallel temporal mixing branch is adopted for temporal modeling.
  • ...and 4 more figures