GestLLM: Advanced Hand Gesture Interpretation via Large Language Models for Human-Robot Interaction
Oleg Kobzarev, Artem Lykov, Dzmitry Tsetserukou
TL;DR
GestLLM addresses the limitation of fixed gesture vocabularies in human-robot interaction by integrating MediaPipe-based hand landmark features with large language models to interpret diverse and culturally nuanced gestures. The approach comprises three parts: picture preprocessing (MediaPipe-based representation), context enhancement (textual feature descriptions of finger positions and trajectories), and task creation/post-processing (LLM-driven gesture-to-command translation with a classifier/explainer and a contextual vector store). Evaluation shows zero-shot recognition on underrepresented gestures with close-range accuracy comparable to GPT-4o and superior robustness at longer ranges, plus a UR3-based user study indicating comparable overall workload to a gamepad, with some increases in cognitive and physical demand but lower frustration. The results suggest GestLLM enables more natural, inclusive gesture-based control with potential applications in advanced HRI, assistive robotics, and interactive entertainment; future work will reduce cognitive/physical load and extend support to dynamic, multi-step gestures.
Abstract
This paper introduces GestLLM, an advanced system for human-robot interaction that enables intuitive robot control through hand gestures. Unlike conventional systems, which rely on a limited set of predefined gestures, GestLLM leverages large language models and feature extraction via MediaPipe to interpret a diverse range of gestures. This integration addresses key limitations in existing systems, such as restricted gesture flexibility and the inability to recognize complex or unconventional gestures commonly used in human communication. By combining state-of-the-art feature extraction and language model capabilities, GestLLM achieves performance comparable to leading vision-language models while supporting gestures underrepresented in traditional datasets. For example, this includes gestures from popular culture, such as the ``Vulcan salute" from Star Trek, without any additional pretraining, prompt engineering, etc. This flexibility enhances the naturalness and inclusivity of robot control, making interactions more intuitive and user-friendly. GestLLM provides a significant step forward in gesture-based interaction, enabling robots to understand and respond to a wide variety of hand gestures effectively. This paper outlines its design, implementation, and evaluation, demonstrating its potential applications in advanced human-robot collaboration, assistive robotics, and interactive entertainment.
