Table of Contents
Fetching ...

ScanBot: Towards Intelligent Surface Scanning in Embodied Robotic Systems

Zhiling Chen, Yang Zhang, Fardin Jalil Piran, Qianyu Zhou, Jiong Tang, Farhad Imani

TL;DR

ScanBot addresses the gap where industrial laser scanning demands sub-millimeter trajectory stability and precise parameter control conditioned on natural language instructions. The authors build a first-in-kind instruction-conditioned, multimodal dataset featuring 12 objects, 6 task types, and synchronized RGB-D, laser profiles, and robot states, enabling end-to-end evaluation of perception, planning, and execution. Benchmarking GPT-4.1, OpenAI o3, Gemini 2.5 Pro, and Gemini 2.5 Flash reveals substantial limitations in parameter tuning, region grounding, and path planning for high-precision scanning, suggesting a need for tool-aware perception and closed-loop control. The work highlights practical implications for industrial inline inspection and sets a foundation for future multi-tool, adaptive scanning systems.

Abstract

We introduce ScanBot, a novel dataset designed for instruction-conditioned, high-precision surface scanning in robotic systems. In contrast to existing robot learning datasets that focus on coarse tasks such as grasping, navigation, or dialogue, ScanBot targets the high-precision demands of industrial laser scanning, where sub-millimeter path continuity and parameter stability are critical. The dataset covers laser scanning trajectories executed by a robot across 12 diverse objects and 6 task types, including full-surface scans, geometry-focused regions, spatially referenced parts, functionally relevant structures, defect inspection, and comparative analysis. Each scan is guided by natural language instructions and paired with synchronized RGB, depth, and laser profiles, as well as robot pose and joint states. Despite recent progress, existing vision-language action (VLA) models still fail to generate stable scanning trajectories under fine-grained instructions and real-world precision demands. To investigate this limitation, we benchmark a range of multimodal large language models (MLLMs) across the full perception-planning-execution loop, revealing persistent challenges in instruction-following under realistic constraints.

ScanBot: Towards Intelligent Surface Scanning in Embodied Robotic Systems

TL;DR

ScanBot addresses the gap where industrial laser scanning demands sub-millimeter trajectory stability and precise parameter control conditioned on natural language instructions. The authors build a first-in-kind instruction-conditioned, multimodal dataset featuring 12 objects, 6 task types, and synchronized RGB-D, laser profiles, and robot states, enabling end-to-end evaluation of perception, planning, and execution. Benchmarking GPT-4.1, OpenAI o3, Gemini 2.5 Pro, and Gemini 2.5 Flash reveals substantial limitations in parameter tuning, region grounding, and path planning for high-precision scanning, suggesting a need for tool-aware perception and closed-loop control. The work highlights practical implications for industrial inline inspection and sets a foundation for future multi-tool, adaptive scanning systems.

Abstract

We introduce ScanBot, a novel dataset designed for instruction-conditioned, high-precision surface scanning in robotic systems. In contrast to existing robot learning datasets that focus on coarse tasks such as grasping, navigation, or dialogue, ScanBot targets the high-precision demands of industrial laser scanning, where sub-millimeter path continuity and parameter stability are critical. The dataset covers laser scanning trajectories executed by a robot across 12 diverse objects and 6 task types, including full-surface scans, geometry-focused regions, spatially referenced parts, functionally relevant structures, defect inspection, and comparative analysis. Each scan is guided by natural language instructions and paired with synchronized RGB, depth, and laser profiles, as well as robot pose and joint states. Despite recent progress, existing vision-language action (VLA) models still fail to generate stable scanning trajectories under fine-grained instructions and real-world precision demands. To investigate this limitation, we benchmark a range of multimodal large language models (MLLMs) across the full perception-planning-execution loop, revealing persistent challenges in instruction-following under realistic constraints.

Paper Structure

This paper contains 21 sections, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Overview and motivation behind the ScanBot dataset. (a) Embodied AI must generalize not only across tasks and environments, but also across tools, each with distinct control and perception demands. (b) Gripper tasks involve discrete object interaction, while scanner tasks require precise region localization and smooth, continuous motion. (c) Traditional laser scanning follows fixed, task-agnostic paths, leading to inefficient coverage and wasted time on irrelevant areas. (d) ScanBot includes 6 real-world components and 6 3D-printed shapes, enabling 6 task types and 4 evaluation capabilities for instruction-conditioned surface scanning.
  • Figure 2: Hardware setup of the ScanBot system. A UR3 robotic arm is equipped with a Keyence LJ-X8200 laser profiler and an Intel RealSense D435i RGB-D camera mounted on the end-effector. A GoPro HERO8 captures third-person views from a fixed tripod. The entire setup operates within a black-curtained environment to ensure consistent and interference-free measurements.
  • Figure 3: Overview of the 12 scanned objects in the ScanBot dataset. The top two rows show six real-world electronic components (four GPU boards, one RAM module, and one WiFi card) alongside their corresponding point clouds. The bottom two rows present six 3D-printed parts grouped into three comparison sets: (1) black and white triangles with no surface features, (2) two cubes with identical shape and color but different embossed patterns, and (3) two cylinders with identical features but different colors.
  • Figure 4: Multiview examples and annotated features in the ScanBot dataset. The first column shows first-person views captured by the Intel RealSense D435i mounted on the robot’s end-effector. The second column presents third-person overviews recorded by a fixed GoPro camera. The third column highlights annotated object features.
  • Figure 5: Distribution of scanning tasks across six instruction types in the ScanBot dataset. The pie chart shows the number of tasks belonging to each task type, with values labeled inside each segment.
  • ...and 6 more figures