Table of Contents
Fetching ...

Learning to Manipulate Anything: Revealing Data Scaling Laws in Bounding-Box Guided Policies

Yihao Wu, Jinming Ma, Junbo Tan, Yanzhao Yu, Shoujie Li, Mingliang Zhou, Diyun Xiang, Xueqian Wang

TL;DR

This work tackles the limited generalization of diffusion-based robotic manipulation under semantic instructions by introducing bounding-box visual guidance. It couples a handheld Label-UMI data-collection device with a Bounding-Box Guided Diffusion Policy (BBox-DP), forming a semantic–motion decoupled framework that transfers generalization to the object-detection module. Through large-scale real-world experiments, it reveals a power-law scaling where generalization improves with the number of bounding-box object classes, enabling an object-diversity–first data collection strategy that achieves ~85% success on four tasks, including unseen objects. The approach offers a practical, scalable path for data-efficient semantic manipulation and provides datasets and code for community release.

Abstract

Diffusion-based policies show limited generalization in semantic manipulation, posing a key obstacle to the deployment of real-world robots. This limitation arises because relying solely on text instructions is inadequate to direct the policy's attention toward the target object in complex and dynamic environments. To solve this problem, we propose leveraging bounding-box instruction to directly specify target object, and further investigate whether data scaling laws exist in semantic manipulation tasks. Specifically, we design a handheld segmentation device with an automated annotation pipeline, Label-UMI, which enables the efficient collection of demonstration data with semantic labels. We further propose a semantic-motion-decoupled framework that integrates object detection and bounding-box guided diffusion policy to improve generalization and adaptability in semantic manipulation. Throughout extensive real-world experiments on large-scale datasets, we validate the effectiveness of the approach, and reveal a power-law relationship between generalization performance and the number of bounding-box objects. Finally, we summarize an effective data collection strategy for semantic manipulation, which can achieve 85\% success rates across four tasks on both seen and unseen objects. All datasets and code will be released to the community.

Learning to Manipulate Anything: Revealing Data Scaling Laws in Bounding-Box Guided Policies

TL;DR

This work tackles the limited generalization of diffusion-based robotic manipulation under semantic instructions by introducing bounding-box visual guidance. It couples a handheld Label-UMI data-collection device with a Bounding-Box Guided Diffusion Policy (BBox-DP), forming a semantic–motion decoupled framework that transfers generalization to the object-detection module. Through large-scale real-world experiments, it reveals a power-law scaling where generalization improves with the number of bounding-box object classes, enabling an object-diversity–first data collection strategy that achieves ~85% success on four tasks, including unseen objects. The approach offers a practical, scalable path for data-efficient semantic manipulation and provides datasets and code for community release.

Abstract

Diffusion-based policies show limited generalization in semantic manipulation, posing a key obstacle to the deployment of real-world robots. This limitation arises because relying solely on text instructions is inadequate to direct the policy's attention toward the target object in complex and dynamic environments. To solve this problem, we propose leveraging bounding-box instruction to directly specify target object, and further investigate whether data scaling laws exist in semantic manipulation tasks. Specifically, we design a handheld segmentation device with an automated annotation pipeline, Label-UMI, which enables the efficient collection of demonstration data with semantic labels. We further propose a semantic-motion-decoupled framework that integrates object detection and bounding-box guided diffusion policy to improve generalization and adaptability in semantic manipulation. Throughout extensive real-world experiments on large-scale datasets, we validate the effectiveness of the approach, and reveal a power-law relationship between generalization performance and the number of bounding-box objects. Finally, we summarize an effective data collection strategy for semantic manipulation, which can achieve 85\% success rates across four tasks on both seen and unseen objects. All datasets and code will be released to the community.
Paper Structure (16 sections, 2 equations, 7 figures, 2 tables)

This paper contains 16 sections, 2 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 0: (a) Structure of Label-UMI. Composition of the Label-UMI Device Components: ➀Laster, ➁Mirror, ➂Mini servo motor, ➃SG90 servo motor, ➄ESP32 microcontroller, ➅PS2 joystick controller, ➆Battery, ➇U-shaped bayonet mount. (b) The Label-UMI Data Collection Operating Procedure. (c) Comparison of the time required to collect and annotate 100 data samples across different devices.
  • Figure 1: Overview of Data Acquisiton pipeline. (a) YOLO model acquisition process. We randomly sample a subset from the total dataset. The first video frame and the LaserPoint-YOLOv8s model are used to locate the laser point serving as a prompt. This prompt, together with the full video, is fed into the SAM2 to segment the object and generate a bounding box for each video frame. A YOLO model is then trained to detect the object using the text label and the image annotated with the bounding box. (b) Full dataset (w/ BBox) acquisition process. We leverage the UMI data pipeline to extract trajectory and image information. The trained YOLO model is then applied to automatically annotate images with bounding boxes, resulting in a full dataset with bounding box labels.
  • Figure 2: Overview of the BBox-DP. (a) Semantic detection part. The raw image and the image overlaid with bounding boxes (generated by YOLO) are encoded separately using a ViT. (b) Main policy part. The resulting visual features are then combined with the robot’s proprioceptive state to form a unified conditioning signal. This multimodal condition guides a U-Net-based diffusion model to iteratively denoise the action sequence, ultimately yielding the refined, denoised action output.
  • Figure 3: Real-robot experiments on four semantic manipulation tasks: (a) Rubbish Disposal – the robot identifies a specified piece of rubbish, grasps it, and discards it into a trash bin; (b) Button Pressing – the robot selects and presses a designated button among multiple visually similar distractors; (c) Water Pouring – the robot grasps a target container and pours its contents accurately into a cup; (d) Drink Fetching – the robot locates a prompted drink from a multi-layer shelf, grasps it, and delivers it to a human.
  • Figure 4: The performance score for each operational object. In each radar chart, objects to the left of the dashed line represent the test set, while those to the right belong to the training set. Since beverage brands and custom-designed buttons are difficult to differentiate textually, we represent them using codes; the full mapping between codes and objects is provided in the supplementary video due to space constraints.
  • ...and 2 more figures