Table of Contents
Fetching ...

Zero-Splat TeleAssist: A Zero-Shot Pose Estimation Framework for Semantic Teleoperation

Srijan Dokania, Dharini Raghavan

TL;DR

This work tackles teleoperation under limited onboard sensing by fusing external CCTV imagery with a zero-shot vision-language pipeline to localize robots without markers. It introduces an end-to-end Zero-SPLAT framework that computes 6-DoF robot poses from monocular video using MC-CLIPSeg for segmentation, MiDaS for depth, and weighted-PCA, then integrates these poses into a 3D Gaussian Splatting map for a global shared frame. The system supports semantic navigation and semi-autonomous planning, achieving real-time performance on low-power hardware and enabling AR overlays and natural-language interactions to reduce operator workload. Experimental results across different robots and scenes demonstrate robust re-localization, improved task efficiency, and decreased cognitive load compared to segmentation-only baselines, with effective handling of robot kidnapping scenarios.

Abstract

We introduce Zero-Splat TeleAssist, a zero-shot sensor-fusion pipeline that transforms commodity CCTV streams into a shared, 6-DoF world model for multilateral teleoperation. By integrating vision-language segmentation, monocular depth, weighted-PCA pose extraction, and 3D Gaussian Splatting (3DGS), TeleAssist provides every operator with real-time global positions and orientations of multiple robots without fiducials or depth sensors in an interaction-centric teleoperation setup.

Zero-Splat TeleAssist: A Zero-Shot Pose Estimation Framework for Semantic Teleoperation

TL;DR

This work tackles teleoperation under limited onboard sensing by fusing external CCTV imagery with a zero-shot vision-language pipeline to localize robots without markers. It introduces an end-to-end Zero-SPLAT framework that computes 6-DoF robot poses from monocular video using MC-CLIPSeg for segmentation, MiDaS for depth, and weighted-PCA, then integrates these poses into a 3D Gaussian Splatting map for a global shared frame. The system supports semantic navigation and semi-autonomous planning, achieving real-time performance on low-power hardware and enabling AR overlays and natural-language interactions to reduce operator workload. Experimental results across different robots and scenes demonstrate robust re-localization, improved task efficiency, and decreased cognitive load compared to segmentation-only baselines, with effective handling of robot kidnapping scenarios.

Abstract

We introduce Zero-Splat TeleAssist, a zero-shot sensor-fusion pipeline that transforms commodity CCTV streams into a shared, 6-DoF world model for multilateral teleoperation. By integrating vision-language segmentation, monocular depth, weighted-PCA pose extraction, and 3D Gaussian Splatting (3DGS), TeleAssist provides every operator with real-time global positions and orientations of multiple robots without fiducials or depth sensors in an interaction-centric teleoperation setup.

Paper Structure

This paper contains 5 sections, 2 figures.

Figures (2)

  • Figure 1: Proposed End-to-End Zero-SPLAT Framework - Stage 0: Camera Placement, Stage 1: Zero shot Segmentation, Stage 2: Pose Uncertainty Estimation, Stage 3: 3D-GS integration, Stage 4 & 5: Semi-Autonomous Planning & Navigation
  • Figure 2: Zero-Splat Results