Zero-Splat TeleAssist: A Zero-Shot Pose Estimation Framework for Semantic Teleoperation
Srijan Dokania, Dharini Raghavan
TL;DR
This work tackles teleoperation under limited onboard sensing by fusing external CCTV imagery with a zero-shot vision-language pipeline to localize robots without markers. It introduces an end-to-end Zero-SPLAT framework that computes 6-DoF robot poses from monocular video using MC-CLIPSeg for segmentation, MiDaS for depth, and weighted-PCA, then integrates these poses into a 3D Gaussian Splatting map for a global shared frame. The system supports semantic navigation and semi-autonomous planning, achieving real-time performance on low-power hardware and enabling AR overlays and natural-language interactions to reduce operator workload. Experimental results across different robots and scenes demonstrate robust re-localization, improved task efficiency, and decreased cognitive load compared to segmentation-only baselines, with effective handling of robot kidnapping scenarios.
Abstract
We introduce Zero-Splat TeleAssist, a zero-shot sensor-fusion pipeline that transforms commodity CCTV streams into a shared, 6-DoF world model for multilateral teleoperation. By integrating vision-language segmentation, monocular depth, weighted-PCA pose extraction, and 3D Gaussian Splatting (3DGS), TeleAssist provides every operator with real-time global positions and orientations of multiple robots without fiducials or depth sensors in an interaction-centric teleoperation setup.
