Table of Contents
Fetching ...

Multimodal "Puppeteer": Exploring Robot Teleoperation Via Virtual Counterpart with LLM-Driven Voice and Gesture Interaction in Augmented Reality

Yuchong Zhang, Bastian Orthmann, Shichen Ji, Michael Welle, Jonne Van Haastregt, Danica Kragic

TL;DR

This study introduces an AR puppeteer system that teleoperates a physical robot via a virtual counterpart, comparing gesture-only control to a multimodal voice+gesture approach. Through a within-subject study (N=42), GO demonstrated higher task efficiency and pragmatic usability, while VG offered accessibility for novices but suffered from voice recognition latency and modality-switching costs. The findings highlight that multimodal AR HRI design should be adaptive to user expertise and task demands, rather than assuming universal benefits from adding modalities. The paper provides evidence-based design guidelines for expertise-aware multimodal AR robot teleoperation and discusses implications for real-world deployment and future improvements in robustness and integration.

Abstract

The integration of robotics and augmented reality (AR) offers promising opportunities to enhance human-robot interaction (HRI) by making teleoperation more transparent, spatially grounded, and intuitive. We present a head-mounted AR "puppeteer" framework in which users control a physical robot via interacting with its virtual counterpart robot using large language model (LLM)-driven voice commands and hand-gesture interaction on the Meta Quest 3. In a within-subject user study with 42 participants performing an AR-based robotic pick-and-place pattern-matching task, we compare two interaction conditions: gesture-only (GO) and combined voice+gesture (VG). Our results show that GO currently provides more reliable and efficient control for this time-critical task, while VG introduces additional flexibility but also latency and recognition issues that can increase workload. We further explore how prior robotics experience shapes participants' perceptions of each modality. Based on these findings, we distill a set of evidence-based design guidelines for AR puppeteer metaphoric robot teleoperation, implicating multimodality as an adaptive strategy that must balance efficiency, robustness, and user expertise rather than assuming that additional modalities are universally beneficial. Our work contributes empirical insights into how multimodal (voice+gesture) interaction influences task efficiency, usability, and user experience in AR-based HRI.

Multimodal "Puppeteer": Exploring Robot Teleoperation Via Virtual Counterpart with LLM-Driven Voice and Gesture Interaction in Augmented Reality

TL;DR

This study introduces an AR puppeteer system that teleoperates a physical robot via a virtual counterpart, comparing gesture-only control to a multimodal voice+gesture approach. Through a within-subject study (N=42), GO demonstrated higher task efficiency and pragmatic usability, while VG offered accessibility for novices but suffered from voice recognition latency and modality-switching costs. The findings highlight that multimodal AR HRI design should be adaptive to user expertise and task demands, rather than assuming universal benefits from adding modalities. The paper provides evidence-based design guidelines for expertise-aware multimodal AR robot teleoperation and discusses implications for real-world deployment and future improvements in robustness and integration.

Abstract

The integration of robotics and augmented reality (AR) offers promising opportunities to enhance human-robot interaction (HRI) by making teleoperation more transparent, spatially grounded, and intuitive. We present a head-mounted AR "puppeteer" framework in which users control a physical robot via interacting with its virtual counterpart robot using large language model (LLM)-driven voice commands and hand-gesture interaction on the Meta Quest 3. In a within-subject user study with 42 participants performing an AR-based robotic pick-and-place pattern-matching task, we compare two interaction conditions: gesture-only (GO) and combined voice+gesture (VG). Our results show that GO currently provides more reliable and efficient control for this time-critical task, while VG introduces additional flexibility but also latency and recognition issues that can increase workload. We further explore how prior robotics experience shapes participants' perceptions of each modality. Based on these findings, we distill a set of evidence-based design guidelines for AR puppeteer metaphoric robot teleoperation, implicating multimodality as an adaptive strategy that must balance efficiency, robustness, and user expertise rather than assuming that additional modalities are universally beneficial. Our work contributes empirical insights into how multimodal (voice+gesture) interaction influences task efficiency, usability, and user experience in AR-based HRI.

Paper Structure

This paper contains 33 sections, 12 figures, 2 tables.

Figures (12)

  • Figure 1: The AR-based multimodal robot “puppeteer” system used in this study, enabling voice and gesture interaction for teleoperation.
  • Figure 2: System realization of the AR puppeteer framework. Left: the technical implementation; right: the headset and the AR view. The user wears an AR headset that overlays a virtual robot onto the physical workspace. By interacting with the virtual robot -- specifically manipulating its end-effector via hand gestures. The AR device computes the corresponding virtual joint positions through inverse kinematics and sends these to a ROS node. The ROS node translates these into control commands for the physical robot using a joint-level PD controller. Simultaneously, the current joint state of the physical robot is sent back to the AR device, where forward kinematics reconstructs the real end-effector pose. This allows visualization of both the desired (virtual) and actual (physical) robot states within the AR view.
  • Figure 3: Gesture for spawning the virtual robot with 'victory' sign.
  • Figure 4: Gesture for puppeteering the virtual robot with three fingers' pinch.
  • Figure 5: Gesture for controlling the gripper with all fingers stretched out.
  • ...and 7 more figures