Table of Contents
Fetching ...

OMG-Bench: A New Challenging Benchmark for Skeleton-based Online Micro Hand Gesture Recognition

Haochen Chang, Pengfei Ren, Buyuan Zhang, Da Li, Tianhao Han, Haoyang Zhang, Liang Xie, Hongbo Chen, Erwei Yin

TL;DR

This work targets skeleton-based online micro gesture recognition by introducing OMG-Bench, the first large-scale public dataset for fine-grained, continuous hand gestures, captured with high-quality multi-view data and semi-automatic annotation. It proposes HMATr, an end-to-end Hierarchical Memory-Augmented Transformer that unifies gesture detection and classification using frame- and window-level memory along with learnable position-aware queries, enabling robust performance with non-overlapping window streaming. HMATr achieves state-of-the-art results on OMG-Bench and also generalizes to online macro-gesture benchmarks SHREC’21/22, backed by extensive ablations and efficiency analyses. The dataset and method collectively advance online micro gesture recognition for VR/AR by providing a challenging benchmark, a scalable annotation pipeline, and a practical, efficient architecture for real-time interaction. The work also demonstrates practical deployment potential via a video demo on consumer-grade hardware, highlighting real-time viability and cross-algorithm robustness.

Abstract

Online micro gesture recognition from hand skeletons is critical for VR/AR interaction but faces challenges due to limited public datasets and task-specific algorithms. Micro gestures involve subtle motion patterns, which make constructing datasets with precise skeletons and frame-level annotations difficult. To this end, we develop a multi-view self-supervised pipeline to automatically generate skeleton data, complemented by heuristic rules and expert refinement for semi-automatic annotation. Based on this pipeline, we introduce OMG-Bench, the first large-scale public benchmark for skeleton-based online micro gesture recognition. It features 40 fine-grained gesture classes with 13,948 instances across 1,272 sequences, characterized by subtle motions, rapid dynamics, and continuous execution. To tackle these challenges, we propose Hierarchical Memory-Augmented Transformer (HMATr), an end-to-end framework that unifies gesture detection and classification by leveraging hierarchical memory banks which store frame-level details and window-level semantics to preserve historical context. In addition, it employs learnable position-aware queries initialized from the memory to implicitly encode gesture positions and semantics. Experiments show that HMATr outperforms state-of-the-art methods by 7.6\% in detection rate, establishing a strong baseline for online micro gesture recognition. Project page: https://omg-bench.github.io/

OMG-Bench: A New Challenging Benchmark for Skeleton-based Online Micro Hand Gesture Recognition

TL;DR

This work targets skeleton-based online micro gesture recognition by introducing OMG-Bench, the first large-scale public dataset for fine-grained, continuous hand gestures, captured with high-quality multi-view data and semi-automatic annotation. It proposes HMATr, an end-to-end Hierarchical Memory-Augmented Transformer that unifies gesture detection and classification using frame- and window-level memory along with learnable position-aware queries, enabling robust performance with non-overlapping window streaming. HMATr achieves state-of-the-art results on OMG-Bench and also generalizes to online macro-gesture benchmarks SHREC’21/22, backed by extensive ablations and efficiency analyses. The dataset and method collectively advance online micro gesture recognition for VR/AR by providing a challenging benchmark, a scalable annotation pipeline, and a practical, efficient architecture for real-time interaction. The work also demonstrates practical deployment potential via a video demo on consumer-grade hardware, highlighting real-time viability and cross-algorithm robustness.

Abstract

Online micro gesture recognition from hand skeletons is critical for VR/AR interaction but faces challenges due to limited public datasets and task-specific algorithms. Micro gestures involve subtle motion patterns, which make constructing datasets with precise skeletons and frame-level annotations difficult. To this end, we develop a multi-view self-supervised pipeline to automatically generate skeleton data, complemented by heuristic rules and expert refinement for semi-automatic annotation. Based on this pipeline, we introduce OMG-Bench, the first large-scale public benchmark for skeleton-based online micro gesture recognition. It features 40 fine-grained gesture classes with 13,948 instances across 1,272 sequences, characterized by subtle motions, rapid dynamics, and continuous execution. To tackle these challenges, we propose Hierarchical Memory-Augmented Transformer (HMATr), an end-to-end framework that unifies gesture detection and classification by leveraging hierarchical memory banks which store frame-level details and window-level semantics to preserve historical context. In addition, it employs learnable position-aware queries initialized from the memory to implicitly encode gesture positions and semantics. Experiments show that HMATr outperforms state-of-the-art methods by 7.6\% in detection rate, establishing a strong baseline for online micro gesture recognition. Project page: https://omg-bench.github.io/

Paper Structure

This paper contains 29 sections, 7 equations, 6 figures, 14 tables.

Figures (6)

  • Figure 1: Data collection and annotation pipeline of OMG-Bench, using a calibrated five-camera RGB-D system and self-supervised multi-view hand pose estimation to obtain high-quality skeletons, followed by semi-automatic frame-level gesture labeling.
  • Figure 2: Dataset properties.(a) Types and locations of defined micro gestures. TIP, PIP, and MCP denote the fingertip, proximal interphalangeal joint, and metacarpophalangeal joint. (b) Statistics of gesture types. (c) Distribution of sample counts per class.
  • Figure 3: Overview of our proposed HMATr. (a) Lightweight backbone processes streaming skeleton inputs using a non-overlapping sliding window approach. (b) Hierarchical memory bank uses historical temporal information to enrich the content of the current window. (c) Position-aware queries implicitly capture potential hand movements, enabling unified detection and recognition. (d) Memory Interaction and Position-aware Interaction encode both position and semantic information of gesture instances from the memory-enhanced features.
  • Figure 4: Visualization of (a) query distribution and (b) online recognition results of the gesture sequence.
  • Figure 5: Types and interaction locations of all defined micro gestures. TIP, PIP, and MCP are anatomical terms for finger parts, referring to the finger tip, proximal interphalangeal joint, and metacarpophalangeal joint respectively.
  • ...and 1 more figures