Rhythm in the Air: Vision-based Real-Time Music Generation through Gestures
Barathi Subramanian, Rathinaraja Jeyaraj, Anand Paul, Kapilya Gangadharan
TL;DR
This work tackles real-time, vision-based gesture-driven music generation by introducing a VDGR pipeline powered by a multi-layer attention-based GRU (MLA-GRU). It builds a custom dataset of $21$ gesture classes (7 notes × 3 pitches) and demonstrates that MLA-GRU achieves higher accuracy and faster inference than a classical GRU, including an accuracy of $96.83 ext{\%}$ vs $86.7\%$ and a micro/macro ROC-AUC near $0.98$. The approach leverages MediaPipe Holistic landmarks to robustly extract features from gestures and applies a three-layer GRU with an attention mechanism to capture multi-scale temporal patterns, enabling real-time music generation through gesture-to-note mappings. The results indicate strong potential for touchless, expressive HCI in live performances and accessible music creation across devices, with the dataset released for community use.
Abstract
Gesture recognition is an essential component of human-computer interaction (HCI), facilitating seamless interconnectivity between users and computer systems without physical touch. This paper introduces an innovative application of vision-based dynamic gesture recognition (VDGR) for real-time music composition through gestures. To implement this application, we generate a custom gesture dataset that encompasses over 15000 samples across 21 classes, incorporating 7 musical notes each manifesting at three distinct pitch levels. To effectively deal with the modest volume of training data and to accurately discern and prioritize complex gesture sequences for music creation, we develop a multi-layer attention-based gated recurrent unit (MLA-GRU) model, in which gated recurrent unit (GRU) is used to learn temporal patterns from the observed sequence and an attention layer is employed to focus on musically pertinent gesture segments. Our empirical studies demonstrate that MLA-GRU significantly surpasses the classical GRU model, achieving a remarkable accuracy of 96.83% compared to the baseline's 86.7%. Moreover, our approach exhibits superior efficiency and processing speed, which are crucial for interactive applications. Using our proposed system, we believe that people will interact with music in a new and exciting way. It not only advances HCI experiences but also highlights MLA-GRU's effectiveness in scenarios demanding swift and precise gesture recognition.
