Robotic Scene Cloning:Advancing Zero-Shot Robotic Scene Adaptation in Manipulation via Visual Prompt Editing

Binyuan Huang; Yuqing Wen; Yucheng Zhao; Yaosi Hu; Tiancai Wang; Chang Wen Chen; Haoqiang Fan; Zhenzhong Chen

Robotic Scene Cloning:Advancing Zero-Shot Robotic Scene Adaptation in Manipulation via Visual Prompt Editing

Binyuan Huang, Yuqing Wen, Yucheng Zhao, Yaosi Hu, Tiancai Wang, Chang Wen Chen, Haoqiang Fan, Zhenzhong Chen

TL;DR

Robotic Scene Cloning is proposed, a novel method designed for scene-specific adaptation by editing existing robot operation trajectories that achieves accurate and scene-consistent sample generation by leveraging a visual prompting mechanism and a carefully tuned condition injection module.

Abstract

Modern robots can perform a wide range of simple tasks and adapt to diverse scenarios in the well-trained environment. However, deploying pre-trained robot models in real-world user scenarios remains challenging due to their limited zero-shot capabilities, often necessitating extensive on-site data collection. To address this issue, we propose Robotic Scene Cloning (RSC), a novel method designed for scene-specific adaptation by editing existing robot operation trajectories. RSC achieves accurate and scene-consistent sample generation by leveraging a visual prompting mechanism and a carefully tuned condition injection module. Not only transferring textures but also performing moderate shape adaptations in response to the visual prompts, RSC demonstrates reliable task performance across a variety of object types. Experiments across various simulated and real-world environments demonstrate that RSC significantly enhances policy generalization in target environments.

Robotic Scene Cloning:Advancing Zero-Shot Robotic Scene Adaptation in Manipulation via Visual Prompt Editing

TL;DR

Abstract

Paper Structure (18 sections, 9 equations, 6 figures, 2 tables)

This paper contains 18 sections, 9 equations, 6 figures, 2 tables.

Introduction
Related Work
Language-conditioned Embodied Policies
Generative Data Augmentation in Embodied Intelligence.
Image translation for robotics
Method
Pipeline of Robotic Scene Cloning
Robotic Condition Generator
Visual Prompt Editor
Progressive Masked Fusion
Inversion
Denoising
Visual-Prompt Guided Image Editing
Implementation Details
Experiment Results on Realistic Robotic Benchmark (SIMPLER).
...and 3 more sections

Figures (6)

Figure 1: Existing robotic policies face challenges when migrating from training to deployment environments, particularly in handling novel products. (a) Recollect Data: Collecting deployment-specific data enables fine-tuning for high accuracy but is labor-intensive and moderately data-efficient. (b) Existing Embodied Augmentation Methods: Augmentation using text prompts reduces labor but achieves limited accuracy and low data efficiency. (c) Robotic Scene Cloning: Cloning scenes with visual cues from the deployment environment achieves high accuracy with better data efficiency and lower labor cost. The comparison highlights the trade-offs in accuracy, labor intensity, and data efficiency for each method. The two rightmost bar charts show the success rates of grasping Monster Energy Drink (V1), Monster Energy Drink (V2), and Disinfectant Bottle in the Simpler environment under distractor conditions.
Figure 2: Overview for Robotic Scene Cloning. (a) RSC pipeline follows a two-stage process. Robotic Condition Generator prepares scene-specific conditions from training trajectories and a new product. Visual Prompt Editor then generates visually cloned trajectories, which fine-tune robotic models for better adaptation to novel products. (b) Robotic Condition Generatorprepares three conditions: First, the Grounding Resampler combines the new product's visual, textual, and positional encodings to generate position-bound visual conditions. Second, Grounding-DINO + SAM2 extract the coordinates and masks of existing products. Third, DepthAnythingV2 and ControlNet capture the pose conditions. (c) Visual Prompt Editor applies the three conditions—visual, pose, and layout—generated by the Robotic Condition Generator. Visual and pose conditions guide the DDIM processes, while the layout condition is used in the progressive masked fusion.
Figure 3: Editing effect comparison between RSC and GreenAug in the SIMPLER Benchmark.
Figure 4: The visualization results of our real-world validation task on the WidowX robot.
Figure 5: Each pair of rows shows the original video frames (Raw) and the corresponding frames edited by our method (RSC) after replacing the target object.
...and 1 more figures

Robotic Scene Cloning:Advancing Zero-Shot Robotic Scene Adaptation in Manipulation via Visual Prompt Editing

TL;DR

Abstract

Robotic Scene Cloning:Advancing Zero-Shot Robotic Scene Adaptation in Manipulation via Visual Prompt Editing

Authors

TL;DR

Abstract

Table of Contents

Figures (6)