Duo Streamers: A Streaming Gesture Recognition Framework
Boxuan Zhu, Sicheng Yang, Zhuo Wang, Haining Liang, Junxiao Shen
TL;DR
Duo Streamers tackles the challenge of real-time gesture recognition on resource-constrained devices by introducing a streaming RNN-lite model with an external hidden state and a three-stage sparse recognition mechanism. The framework uses a lightweight Euclidean Analyzer, a binary Detector, and an active Recognizer to process data with minimal idle computation and rapid activation when gestures occur, complemented by a streaming training and post-processing pipeline. Experimental results on SHREC2021 show that Duo Streamers achieves accuracy comparable to baselines while significantly reducing parameters (down to 1/38 idle, 1/9 busy) and real-time factor (nearly 13x faster), plus an Early Detection Latency of 6.38 frames. The work demonstrates strong potential for edge deployment and provides a foundation for multimodal and diverse real-time interaction scenarios in wearable and mobile devices.
Abstract
Gesture recognition in resource-constrained scenarios faces significant challenges in achieving high accuracy and low latency. The streaming gesture recognition framework, Duo Streamers, proposed in this paper, addresses these challenges through a three-stage sparse recognition mechanism, an RNN-lite model with an external hidden state, and specialized training and post-processing pipelines, thereby making innovative progress in real-time performance and lightweight design. Experimental results show that Duo Streamers matches mainstream methods in accuracy metrics, while reducing the real-time factor by approximately 92.3%, i.e., delivering a nearly 13-fold speedup. In addition, the framework shrinks parameter counts to 1/38 (idle state) and 1/9 (busy state) compared to mainstream models. In summary, Duo Streamers not only offers an efficient and practical solution for streaming gesture recognition in resource-constrained devices but also lays a solid foundation for extended applications in multimodal and diverse scenarios.
