Query matching for spatio-temporal action detection with query-based object detector
Shimon Hori, Kazuki Omi, Toru Tamaki
TL;DR
This work addresses spatio-temporal action detection (STAD) by extending the DETR framework with temporal feature shift and a query-matching mechanism to maintain temporal consistency across frames. By applying DETR frame-by-frame and aligning object queries via a Hungarian-matching-based method before shifting, the model propagates object-specific information over time. Experiments on JHMDB21 show that shifting decoder features and queries, when combined with query matching, yields notable gains in video-mAP (up to around 6.4%) while keeping frame-mAP reasonable, indicating improved tube-level detection. The approach preserves DETR’s architecture with minimal modifications and suggests strong potential for transfer to larger STAD datasets like AVA, with future work to understand the mechanics behind the gains and to benchmark against prior methods.
Abstract
In this paper, we propose a method that extends the query-based object detection model, DETR, to spatio-temporal action detection, which requires maintaining temporal consistency in videos. Our proposed method applies DETR to each frame and uses feature shift to incorporate temporal information. However, DETR's object queries in each frame may correspond to different objects, making a simple feature shift ineffective. To overcome this issue, we propose query matching across different frames, ensuring that queries for the same object are matched and used for the feature shift. Experimental results show that performance on the JHMDB21 dataset improves significantly when query features are shifted using the proposed query matching.
