Multimodal "Puppeteer": Exploring Robot Teleoperation Via Virtual Counterpart with LLM-Driven Voice and Gesture Interaction in Augmented Reality
Yuchong Zhang, Bastian Orthmann, Shichen Ji, Michael Welle, Jonne Van Haastregt, Danica Kragic
TL;DR
This study introduces an AR puppeteer system that teleoperates a physical robot via a virtual counterpart, comparing gesture-only control to a multimodal voice+gesture approach. Through a within-subject study (N=42), GO demonstrated higher task efficiency and pragmatic usability, while VG offered accessibility for novices but suffered from voice recognition latency and modality-switching costs. The findings highlight that multimodal AR HRI design should be adaptive to user expertise and task demands, rather than assuming universal benefits from adding modalities. The paper provides evidence-based design guidelines for expertise-aware multimodal AR robot teleoperation and discusses implications for real-world deployment and future improvements in robustness and integration.
Abstract
The integration of robotics and augmented reality (AR) offers promising opportunities to enhance human-robot interaction (HRI) by making teleoperation more transparent, spatially grounded, and intuitive. We present a head-mounted AR "puppeteer" framework in which users control a physical robot via interacting with its virtual counterpart robot using large language model (LLM)-driven voice commands and hand-gesture interaction on the Meta Quest 3. In a within-subject user study with 42 participants performing an AR-based robotic pick-and-place pattern-matching task, we compare two interaction conditions: gesture-only (GO) and combined voice+gesture (VG). Our results show that GO currently provides more reliable and efficient control for this time-critical task, while VG introduces additional flexibility but also latency and recognition issues that can increase workload. We further explore how prior robotics experience shapes participants' perceptions of each modality. Based on these findings, we distill a set of evidence-based design guidelines for AR puppeteer metaphoric robot teleoperation, implicating multimodality as an adaptive strategy that must balance efficiency, robustness, and user expertise rather than assuming that additional modalities are universally beneficial. Our work contributes empirical insights into how multimodal (voice+gesture) interaction influences task efficiency, usability, and user experience in AR-based HRI.
