Zero-Splat TeleAssist: A Zero-Shot Pose Estimation Framework for Semantic Teleoperation

Srijan Dokania; Dharini Raghavan

Zero-Splat TeleAssist: A Zero-Shot Pose Estimation Framework for Semantic Teleoperation

Srijan Dokania, Dharini Raghavan

TL;DR

This work tackles teleoperation under limited onboard sensing by fusing external CCTV imagery with a zero-shot vision-language pipeline to localize robots without markers. It introduces an end-to-end Zero-SPLAT framework that computes 6-DoF robot poses from monocular video using MC-CLIPSeg for segmentation, MiDaS for depth, and weighted-PCA, then integrates these poses into a 3D Gaussian Splatting map for a global shared frame. The system supports semantic navigation and semi-autonomous planning, achieving real-time performance on low-power hardware and enabling AR overlays and natural-language interactions to reduce operator workload. Experimental results across different robots and scenes demonstrate robust re-localization, improved task efficiency, and decreased cognitive load compared to segmentation-only baselines, with effective handling of robot kidnapping scenarios.

Abstract

We introduce Zero-Splat TeleAssist, a zero-shot sensor-fusion pipeline that transforms commodity CCTV streams into a shared, 6-DoF world model for multilateral teleoperation. By integrating vision-language segmentation, monocular depth, weighted-PCA pose extraction, and 3D Gaussian Splatting (3DGS), TeleAssist provides every operator with real-time global positions and orientations of multiple robots without fiducials or depth sensors in an interaction-centric teleoperation setup.

Zero-Splat TeleAssist: A Zero-Shot Pose Estimation Framework for Semantic Teleoperation

TL;DR

Abstract

Zero-Splat TeleAssist: A Zero-Shot Pose Estimation Framework for Semantic Teleoperation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)