Table of Contents
Fetching ...

Action tube generation by person query matching for spatio-temporal action detection

Kazuki Omi, Jion Oshima, Toru Tamaki

TL;DR

This work tackles spatio-temporal action detection by directly generating action tubes from raw video, eliminating IoU-based linking and clip splitting. It introduces a DETR-like framewise detector whose queries are linked across frames by a Query Matching Module trained with metric learning, enabling robust tracking of the same person despite large motions. An action head operates on sequences of linked queries to produce per-frame action scores, allowing variable-length inputs and end-to-end tube output. The approach achieves favorable efficiency with competitive accuracy on JHMDB, UCF101-24, and AVA, and shows particular strength for large-position-change actions, making it suitable for resource-constrained or real-time applications.

Abstract

This paper proposes a method for spatio-temporal action detection (STAD) that directly generates action tubes from the original video without relying on post-processing steps such as IoU-based linking and clip splitting. Our approach applies query-based detection (DETR) to each frame and matches DETR queries to link the same person across frames. We introduce the Query Matching Module (QMM), which uses metric learning to bring queries for the same person closer together across frames compared to queries for different people. Action classes are predicted using the sequence of queries obtained from QMM matching, allowing for variable-length inputs from videos longer than a single clip. Experimental results on JHMDB, UCF101-24, and AVA datasets demonstrate that our method performs well for large position changes of people while offering superior computational efficiency and lower resource requirements.

Action tube generation by person query matching for spatio-temporal action detection

TL;DR

This work tackles spatio-temporal action detection by directly generating action tubes from raw video, eliminating IoU-based linking and clip splitting. It introduces a DETR-like framewise detector whose queries are linked across frames by a Query Matching Module trained with metric learning, enabling robust tracking of the same person despite large motions. An action head operates on sequences of linked queries to produce per-frame action scores, allowing variable-length inputs and end-to-end tube output. The approach achieves favorable efficiency with competitive accuracy on JHMDB, UCF101-24, and AVA, and shows particular strength for large-position-change actions, making it suitable for resource-constrained or real-time applications.

Abstract

This paper proposes a method for spatio-temporal action detection (STAD) that directly generates action tubes from the original video without relying on post-processing steps such as IoU-based linking and clip splitting. Our approach applies query-based detection (DETR) to each frame and matches DETR queries to link the same person across frames. We introduce the Query Matching Module (QMM), which uses metric learning to bring queries for the same person closer together across frames compared to queries for different people. Action classes are predicted using the sequence of queries obtained from QMM matching, allowing for variable-length inputs from videos longer than a single clip. Experimental results on JHMDB, UCF101-24, and AVA datasets demonstrate that our method performs well for large position changes of people while offering superior computational efficiency and lower resource requirements.

Paper Structure

This paper contains 27 sections, 10 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1:
  • Figure 2:
  • Figure 4: The proposed approach. Linking is performed by matching queries assigned to the same person, eliminating the need for the IoU-based linking.
  • Figure 5: Overview of the proposed method. First, the frame features are obtained using the frame backbone. Next, the frame features and queries interact in a transformer. Then, the proposed Query Matching Module (QMM) matches the queries responsible for the same person in different frames. Finally, the output of QMM is classified by the action head to predict actions and bounding boxes in each frame.
  • Figure 6: Overview of the QMM Procedure. Initially, we collect the query set $Q^t$ at frame $t$. Subsequently, we filter out queris $Q^t$ that are not responsible for any person to form the query set $Q^{t \prime}$. Next, we apply $f_p$ to queries in $Q^{t \prime}$ and in the list $L_j$ for the same person. Finally, we compute the similarity between the encoded features and, if it surpasses a threshold, we identify it as the same person and add the query to the corresponding list.
  • ...and 1 more figures