Table of Contents
Fetching ...

RoboTransfer: Geometry-Consistent Video Diffusion for Robotic Visual Policy Transfer

Liu Liu, Xiaofeng Wang, Guosheng Zhao, Keyu Li, Wenkang Qin, Jiaxiong Qiu, Zheng Zhu, Guan Huang, Zhizhong Su

TL;DR

RoboTransfer addresses the sim-to-real gap in robotic visuomotor learning by introducing a diffusion-based video synthesis framework that enforces geometry consistency across multiple views and offers explicit control over background and object appearances. It combines depth and normal geometry conditioning with background and object appearance signals, using a multi-view encoding strategy and a dedicated dataset pipeline to create geometry-appearance triplets from real data. The approach yields improved multi-view consistency and appearance fidelity, and policies trained with RoboTransfer data show substantial gains in both DIFF-OBJ and especially DIFF-ALL scenarios in real-robot tasks. Real-world experiments validate the data augmentation benefits, highlighting RoboTransfer as a practical tool for enhancing robotic policy generalization through controllable synthetic data.

Abstract

Imitation Learning has become a fundamental approach in robotic manipulation. However, collecting large-scale real-world robot demonstrations is prohibitively expensive. Simulators offer a cost-effective alternative, but the sim-to-real gap make it extremely challenging to scale. Therefore, we introduce RoboTransfer, a diffusion-based video generation framework for robotic data synthesis. Unlike previous methods, RoboTransfer integrates multi-view geometry with explicit control over scene components, such as background and object attributes. By incorporating cross-view feature interactions and global depth/normal conditions, RoboTransfer ensures geometry consistency across views. This framework allows fine-grained control, including background edits and object swaps. Experiments demonstrate that RoboTransfer is capable of generating multi-view videos with enhanced geometric consistency and visual fidelity. In addition, policies trained on the data generated by RoboTransfer achieve a 33.3% relative improvement in the success rate in the DIFF-OBJ setting and a substantial 251% relative improvement in the more challenging DIFF-ALL scenario. Explore more demos on our project page: https://horizonrobotics.github.io/robot_lab/robotransfer

RoboTransfer: Geometry-Consistent Video Diffusion for Robotic Visual Policy Transfer

TL;DR

RoboTransfer addresses the sim-to-real gap in robotic visuomotor learning by introducing a diffusion-based video synthesis framework that enforces geometry consistency across multiple views and offers explicit control over background and object appearances. It combines depth and normal geometry conditioning with background and object appearance signals, using a multi-view encoding strategy and a dedicated dataset pipeline to create geometry-appearance triplets from real data. The approach yields improved multi-view consistency and appearance fidelity, and policies trained with RoboTransfer data show substantial gains in both DIFF-OBJ and especially DIFF-ALL scenarios in real-robot tasks. Real-world experiments validate the data augmentation benefits, highlighting RoboTransfer as a practical tool for enhancing robotic policy generalization through controllable synthetic data.

Abstract

Imitation Learning has become a fundamental approach in robotic manipulation. However, collecting large-scale real-world robot demonstrations is prohibitively expensive. Simulators offer a cost-effective alternative, but the sim-to-real gap make it extremely challenging to scale. Therefore, we introduce RoboTransfer, a diffusion-based video generation framework for robotic data synthesis. Unlike previous methods, RoboTransfer integrates multi-view geometry with explicit control over scene components, such as background and object attributes. By incorporating cross-view feature interactions and global depth/normal conditions, RoboTransfer ensures geometry consistency across views. This framework allows fine-grained control, including background edits and object swaps. Experiments demonstrate that RoboTransfer is capable of generating multi-view videos with enhanced geometric consistency and visual fidelity. In addition, policies trained on the data generated by RoboTransfer achieve a 33.3% relative improvement in the success rate in the DIFF-OBJ setting and a substantial 251% relative improvement in the more challenging DIFF-ALL scenario. Explore more demos on our project page: https://horizonrobotics.github.io/robot_lab/robotransfer

Paper Structure

This paper contains 25 sections, 7 equations, 15 figures, 3 tables, 2 algorithms.

Figures (15)

  • Figure 1: Overview of RoboTransfer.
  • Figure 2: The RoboTransfer framework performs multi-view consistent modeling to jointly reason across viewpoints. It represents geometry with scaled depth and normal maps, and encodes appearance using reference backgrounds and object-specific images for detailed control over appearance.
  • Figure 3: RoboTransfer data processing pipeline. The data processing pipeline consists of two main components: Geometry conditions (left) are derived by using Mono-Normal and Video Depth Anything, which are scale-aligned with sensor depth to ensure consistency. Appearance conditions (right) are obtained by sampling keyframes from the video. GPT-4 is used to generate object descriptions, which are then processed by Grounding-SAM to create per-object masks. Additionally, background inpainting is used to generate complete reference backgrounds.
  • Figure 4: Visualizations of RoboTransfer with different background reference images.
  • Figure 5: Visualizations of RoboTransfer with different object reference images.
  • ...and 10 more figures