Table of Contents
Fetching ...

A Multi-View 3D Telepresence System for XR Robot Teleoperation

Enes Ulas Dincer, Manuel Zaremski, Alexandra Nick, Elias Wucher, Barbara Deml, Gerhard Neumann

Abstract

Robot teleoperation is critical for applications such as remote maintenance, fleet robotics, search and rescue, and data collection for robot learning. Effective teleoperation requires intuitive 3D visualization with reliable depth cues, which conventional screen-based interfaces often fail to provide. We introduce a multi-view VR telepresence system that (1) fuses geometry from three cameras to produce GPU-accelerated point-cloud rendering on standalone VR hardware, and (2) integrates a wrist-mounted RGB stream to provide high-resolution local detail where point-cloud accuracy is limited. Our pipeline supports real-time rendering of approximately 75k points on the Meta Quest 3. A within-subject study was conducted with 31 participants to compare our system to other visualisation modalities, such as RGB streams, a projection of stereo-vision directly in the VR device and point clouds without providing additional RGB information. Across three different teleoperated manipulation tasks, we measured task success, completion time, perceived workload, and usability. Our system achieved the best overall performance, while the Point Cloud modality without RGB also outperforming the RGB streams and OpenTeleVision. These results show that combining global 3D structure with localized high-resolution detail substantially improves telepresence for manipulation and provides a strong foundation for next-generation robot teleoperation systems.

A Multi-View 3D Telepresence System for XR Robot Teleoperation

Abstract

Robot teleoperation is critical for applications such as remote maintenance, fleet robotics, search and rescue, and data collection for robot learning. Effective teleoperation requires intuitive 3D visualization with reliable depth cues, which conventional screen-based interfaces often fail to provide. We introduce a multi-view VR telepresence system that (1) fuses geometry from three cameras to produce GPU-accelerated point-cloud rendering on standalone VR hardware, and (2) integrates a wrist-mounted RGB stream to provide high-resolution local detail where point-cloud accuracy is limited. Our pipeline supports real-time rendering of approximately 75k points on the Meta Quest 3. A within-subject study was conducted with 31 participants to compare our system to other visualisation modalities, such as RGB streams, a projection of stereo-vision directly in the VR device and point clouds without providing additional RGB information. Across three different teleoperated manipulation tasks, we measured task success, completion time, perceived workload, and usability. Our system achieved the best overall performance, while the Point Cloud modality without RGB also outperforming the RGB streams and OpenTeleVision. These results show that combining global 3D structure with localized high-resolution detail substantially improves telepresence for manipulation and provides a strong foundation for next-generation robot teleoperation systems.

Paper Structure

This paper contains 22 sections, 8 figures.

Figures (8)

  • Figure C1: Leader–follower teleoperation system. One Panda arm acts as the leader, physically guided in gravity-compensation mode, while the second acts as the follower in the manipulation workspace. The workspace is observed by three static RGB-D cameras (left, right, upper) and a wrist-mounted RGB-D camera. The participant wears a Meta Quest 3 headset and experiences one of the four visualization conditions (RGBs, PC, PC+RGB, OT).
  • Figure C2: Visualization modalities evaluated in our VR teleoperation study, illustrated here for the cup–insertion task. From left to right: (a) RGBs: four virtual screens showing left, right, top, and wrist RGB streams; (b) PC: semantically filtered fused point cloud rendered together with the leader arm mesh; (c) PC+RGB: the same fused point cloud augmented with a wrist–mounted RGB view rigidly attached to the end–effector. The magenta inset shows a zoomed crop of this wrist RGB stream, highlighting the available high–resolution local detail; (d) OT: ego-centric stereo view from the OpenTeleVision setup.
  • Figure D1: An overview of the study procedure, highlighting the elements of data collection and randomization of both the displayed visualization modalities and the manipulation tasks.
  • Figure D2: The three manipulation tasks used in the study, shown as start (top), intermediate (middle), and goal (bottom) states: (a) two sequential cup insertions with occluding obstacles, (b) T-shape stacking/assembly, and (c) wire-loop placement around an L-shaped stand.
  • Figure E1: Boxplots of NASA--TLX ratings across the four visualization conditions (RGBs, PC, PC+RGB, OT). Higher values indicate higher subjective workload. Each subplot represents one workload dimension. Asterisks indicate the level of statistical significance (* p $<$ .05, ** p $<$ .01, *** p $<$ .001).
  • ...and 3 more figures