ScanBot: Towards Intelligent Surface Scanning in Embodied Robotic Systems
Zhiling Chen, Yang Zhang, Fardin Jalil Piran, Qianyu Zhou, Jiong Tang, Farhad Imani
TL;DR
ScanBot addresses the gap where industrial laser scanning demands sub-millimeter trajectory stability and precise parameter control conditioned on natural language instructions. The authors build a first-in-kind instruction-conditioned, multimodal dataset featuring 12 objects, 6 task types, and synchronized RGB-D, laser profiles, and robot states, enabling end-to-end evaluation of perception, planning, and execution. Benchmarking GPT-4.1, OpenAI o3, Gemini 2.5 Pro, and Gemini 2.5 Flash reveals substantial limitations in parameter tuning, region grounding, and path planning for high-precision scanning, suggesting a need for tool-aware perception and closed-loop control. The work highlights practical implications for industrial inline inspection and sets a foundation for future multi-tool, adaptive scanning systems.
Abstract
We introduce ScanBot, a novel dataset designed for instruction-conditioned, high-precision surface scanning in robotic systems. In contrast to existing robot learning datasets that focus on coarse tasks such as grasping, navigation, or dialogue, ScanBot targets the high-precision demands of industrial laser scanning, where sub-millimeter path continuity and parameter stability are critical. The dataset covers laser scanning trajectories executed by a robot across 12 diverse objects and 6 task types, including full-surface scans, geometry-focused regions, spatially referenced parts, functionally relevant structures, defect inspection, and comparative analysis. Each scan is guided by natural language instructions and paired with synchronized RGB, depth, and laser profiles, as well as robot pose and joint states. Despite recent progress, existing vision-language action (VLA) models still fail to generate stable scanning trajectories under fine-grained instructions and real-world precision demands. To investigate this limitation, we benchmark a range of multimodal large language models (MLLMs) across the full perception-planning-execution loop, revealing persistent challenges in instruction-following under realistic constraints.
