Micro-gesture Online Recognition using Learnable Query Points

Pengyu Liu; Fei Wang; Kun Li; Guoliang Chen; Yanyan Wei; Shengeng Tang; Zhiliang Wu; Dan Guo

Micro-gesture Online Recognition using Learnable Query Points

Pengyu Liu, Fei Wang, Kun Li, Guoliang Chen, Yanyan Wei, Shengeng Tang, Zhiliang Wu, Dan Guo

TL;DR

This work tackles Micro-gesture Online Recognition by reframing it as a set-prediction problem with learnable query points and vectors. It extends the PointTAD baseline with a Mamba-MHSA block and a Multi-Level Interactive Module to better model temporal semantics and boundary localization, evaluated on the SMG dataset. The proposed method achieves $F1=14.34$ and ranks second in the MiGA track, demonstrating improved MG discrimination and boundary detection, with ablations guiding design choices such as $N_q$, window size, decoder depth, and Mamba blocks. Future work includes integrating skeletal data to further enhance recognition performance and robustness.

Abstract

In this paper, we briefly introduce the solution developed by our team, HFUT-VUT, for the Micro-gesture Online Recognition track in the MiGA challenge at IJCAI 2024. The Micro-gesture Online Recognition task involves identifying the category and locating the start and end times of micro-gestures in video clips. Compared to the typical Temporal Action Detection task, the Micro-gesture Online Recognition task focuses more on distinguishing between micro-gestures and pinpointing the start and end times of actions. Our solution ranks 2nd in the Micro-gesture Online Recognition track.

Micro-gesture Online Recognition using Learnable Query Points

TL;DR

and ranks second in the MiGA track, demonstrating improved MG discrimination and boundary detection, with ablations guiding design choices such as

, window size, decoder depth, and Mamba blocks. Future work includes integrating skeletal data to further enhance recognition performance and robustness.

Abstract

Paper Structure (15 sections, 16 equations, 1 figure, 2 tables)

This paper contains 15 sections, 16 equations, 1 figure, 2 tables.

Introduction
Related Work
Method
Task Definition
Overall Architecture
Video Encoder
Learnable Query Points
Mamba-MHSA Block
Multi-Level Interactive Module
Experiments
Dataset and Evaluation Metric
Implementation Details
Experimental Results
Ablation Study
Conclusion

Figures (1)

Figure 1: The proposed model consists of a video encoder, which extracts video features from continuous RGB frames, and an action decoder.

Micro-gesture Online Recognition using Learnable Query Points

TL;DR

Abstract

Micro-gesture Online Recognition using Learnable Query Points

Authors

TL;DR

Abstract

Table of Contents

Figures (1)