Table of Contents
Fetching ...

Duo Streamers: A Streaming Gesture Recognition Framework

Boxuan Zhu, Sicheng Yang, Zhuo Wang, Haining Liang, Junxiao Shen

TL;DR

Duo Streamers tackles the challenge of real-time gesture recognition on resource-constrained devices by introducing a streaming RNN-lite model with an external hidden state and a three-stage sparse recognition mechanism. The framework uses a lightweight Euclidean Analyzer, a binary Detector, and an active Recognizer to process data with minimal idle computation and rapid activation when gestures occur, complemented by a streaming training and post-processing pipeline. Experimental results on SHREC2021 show that Duo Streamers achieves accuracy comparable to baselines while significantly reducing parameters (down to 1/38 idle, 1/9 busy) and real-time factor (nearly 13x faster), plus an Early Detection Latency of 6.38 frames. The work demonstrates strong potential for edge deployment and provides a foundation for multimodal and diverse real-time interaction scenarios in wearable and mobile devices.

Abstract

Gesture recognition in resource-constrained scenarios faces significant challenges in achieving high accuracy and low latency. The streaming gesture recognition framework, Duo Streamers, proposed in this paper, addresses these challenges through a three-stage sparse recognition mechanism, an RNN-lite model with an external hidden state, and specialized training and post-processing pipelines, thereby making innovative progress in real-time performance and lightweight design. Experimental results show that Duo Streamers matches mainstream methods in accuracy metrics, while reducing the real-time factor by approximately 92.3%, i.e., delivering a nearly 13-fold speedup. In addition, the framework shrinks parameter counts to 1/38 (idle state) and 1/9 (busy state) compared to mainstream models. In summary, Duo Streamers not only offers an efficient and practical solution for streaming gesture recognition in resource-constrained devices but also lays a solid foundation for extended applications in multimodal and diverse scenarios.

Duo Streamers: A Streaming Gesture Recognition Framework

TL;DR

Duo Streamers tackles the challenge of real-time gesture recognition on resource-constrained devices by introducing a streaming RNN-lite model with an external hidden state and a three-stage sparse recognition mechanism. The framework uses a lightweight Euclidean Analyzer, a binary Detector, and an active Recognizer to process data with minimal idle computation and rapid activation when gestures occur, complemented by a streaming training and post-processing pipeline. Experimental results on SHREC2021 show that Duo Streamers achieves accuracy comparable to baselines while significantly reducing parameters (down to 1/38 idle, 1/9 busy) and real-time factor (nearly 13x faster), plus an Early Detection Latency of 6.38 frames. The work demonstrates strong potential for edge deployment and provides a foundation for multimodal and diverse real-time interaction scenarios in wearable and mobile devices.

Abstract

Gesture recognition in resource-constrained scenarios faces significant challenges in achieving high accuracy and low latency. The streaming gesture recognition framework, Duo Streamers, proposed in this paper, addresses these challenges through a three-stage sparse recognition mechanism, an RNN-lite model with an external hidden state, and specialized training and post-processing pipelines, thereby making innovative progress in real-time performance and lightweight design. Experimental results show that Duo Streamers matches mainstream methods in accuracy metrics, while reducing the real-time factor by approximately 92.3%, i.e., delivering a nearly 13-fold speedup. In addition, the framework shrinks parameter counts to 1/38 (idle state) and 1/9 (busy state) compared to mainstream models. In summary, Duo Streamers not only offers an efficient and practical solution for streaming gesture recognition in resource-constrained devices but also lays a solid foundation for extended applications in multimodal and diverse scenarios.

Paper Structure

This paper contains 20 sections, 3 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Window-based Model vs. Streaming Model. The proposed streaming model reduces model size and can be activated on demand to deliver immediate inference over the data stream, thereby lowering latency and improving real‐time performance.
  • Figure 2: Our proposed RNN-lite streaming model and its three-stage sparse recognition mechanism. The skeleton stream first passes through the Euclidean analyzer and the Detector’s gated external hidden state, continuously monitoring for the presence of valid gestures. Once a valid gesture is detected, the dormant gesture recognition model (Recognizer) is awakened, and during this active phase, gating is applied to the Recognizer’s external hidden state. After the valid gesture concludes, the Recognizer returns to its dormant state, significantly reducing computational overhead and energy consumption. By storing temporal information in a compressed external hidden state and relying solely on the compact Detector for binary classification during idle periods, the model substantially lowers its dependency on hardware resources. Ultimately, the parameters required for inference are reduced to just one-ninth to one-thirty-eighth of the baseline model’s, enabling unified deployment across multiple platforms while enhancing early recognition capabilities.
  • Figure 3: System logs as it processes data from the SHREC 2021 dataset and camera-based skeletal streams. The left figure presents the detection–recognition outputs and the ground truth, where differently colored dashed lines indicate the confidence of each gesture over time (in seconds), ranging from 0 to 1. The right figure illustrates how the model performs real-time streaming data processing on camera inputs. Given that camera data are pixel-based, we first utilize the Mediapipe library to generate hand keypoint skeleton streams, which are then fed into our framework. The logs indicate that the model balances early recognition with accuracy across various gestures, demonstrating strong adaptability and effectiveness in real-time, high-frequency interactive scenarios. Relevant video content for visualization can be found in the supplementary materials.