Table of Contents
Fetching ...

EMMA: Generalizing Real-World Robot Manipulation via Generative Visual Transfer

Zhehao Dong, Xiaofeng Wang, Zheng Zhu, Yirui Wang, Yang Wang, Yukun Zhou, Boyuan Wang, Chaojun Ni, Runqi Ouyang, Wenkang Qin, Xinze Chen, Yun Ye, Guan Huang

TL;DR

EMMA addresses data scarcity for vision-language-action policies in real-world robot manipulation by integrating DreamTransfer, a diffusion-transformer that generates multi-view, geometry-preserving, text-edited manipulation videos, with AdaMix, a hard-sample-aware training strategy. DreamTransfer enables robust sim-to-real and real-to-real transfer through depth-conditioned, multi-view-consistent video generation and precise text control of appearance. AdaMix dynamically reweights training data to emphasize challenging trajectories, improving policy robustness and generalization. Real-world experiments show substantial gains over real-data-only training, with DreamTransfer delivering large improvements and AdaMix providing an additional ~13% boost, demonstrating practical impact for cross-domain robotic manipulation.

Abstract

Vision-language-action (VLA) models increasingly rely on diverse training data to achieve robust generalization. However, collecting large-scale real-world robot manipulation data across varied object appearances and environmental conditions remains prohibitively time-consuming and expensive. To overcome this bottleneck, we propose Embodied Manipulation Media Adaptation (EMMA), a VLA policy enhancement framework that integrates a generative data engine with an effective training pipeline. We introduce DreamTransfer, a diffusion Transformer-based framework for generating multi-view consistent, geometrically grounded embodied manipulation videos. DreamTransfer enables text-controlled visual editing of robot videos, transforming foreground, background, and lighting conditions without compromising 3D structure or geometrical plausibility. Furthermore, we explore hybrid training with real and generated data, and introduce AdaMix, a hard-sample-aware training strategy that dynamically reweights training batches to focus optimization on perceptually or kinematically challenging samples. Extensive experiments show that videos generated by DreamTransfer significantly outperform prior video generation methods in multi-view consistency, geometric fidelity, and text-conditioning accuracy. Crucially, VLAs trained with generated data enable robots to generalize to unseen object categories and novel visual domains using only demonstrations from a single appearance. In real-world robotic manipulation tasks with zero-shot visual domains, our approach achieves over a 200% relative performance gain compared to training on real data alone, and further improves by 13% with AdaMix, demonstrating its effectiveness in boosting policy generalization.

EMMA: Generalizing Real-World Robot Manipulation via Generative Visual Transfer

TL;DR

EMMA addresses data scarcity for vision-language-action policies in real-world robot manipulation by integrating DreamTransfer, a diffusion-transformer that generates multi-view, geometry-preserving, text-edited manipulation videos, with AdaMix, a hard-sample-aware training strategy. DreamTransfer enables robust sim-to-real and real-to-real transfer through depth-conditioned, multi-view-consistent video generation and precise text control of appearance. AdaMix dynamically reweights training data to emphasize challenging trajectories, improving policy robustness and generalization. Real-world experiments show substantial gains over real-data-only training, with DreamTransfer delivering large improvements and AdaMix providing an additional ~13% boost, demonstrating practical impact for cross-domain robotic manipulation.

Abstract

Vision-language-action (VLA) models increasingly rely on diverse training data to achieve robust generalization. However, collecting large-scale real-world robot manipulation data across varied object appearances and environmental conditions remains prohibitively time-consuming and expensive. To overcome this bottleneck, we propose Embodied Manipulation Media Adaptation (EMMA), a VLA policy enhancement framework that integrates a generative data engine with an effective training pipeline. We introduce DreamTransfer, a diffusion Transformer-based framework for generating multi-view consistent, geometrically grounded embodied manipulation videos. DreamTransfer enables text-controlled visual editing of robot videos, transforming foreground, background, and lighting conditions without compromising 3D structure or geometrical plausibility. Furthermore, we explore hybrid training with real and generated data, and introduce AdaMix, a hard-sample-aware training strategy that dynamically reweights training batches to focus optimization on perceptually or kinematically challenging samples. Extensive experiments show that videos generated by DreamTransfer significantly outperform prior video generation methods in multi-view consistency, geometric fidelity, and text-conditioning accuracy. Crucially, VLAs trained with generated data enable robots to generalize to unseen object categories and novel visual domains using only demonstrations from a single appearance. In real-world robotic manipulation tasks with zero-shot visual domains, our approach achieves over a 200% relative performance gain compared to training on real data alone, and further improves by 13% with AdaMix, demonstrating its effectiveness in boosting policy generalization.

Paper Structure

This paper contains 26 sections, 8 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: DreamTransfer demonstrates strong controllability in embodied manipulation video generation. It excels in text-controlled appearance editing while preserving 3D structure and geometric plausibility, and supports both real-to-real and sim-to-real transfer. The complete prompts used for generation is provided in the supplementary materials.
  • Figure 2: Overview of the EMMA framework. First, DreamTransfer generates multi-view consistent videos by performing text-controlled visual editing of the foreground, background, and lighting conditions, conditioned on depth and corresponding text prompts. The generated videos are then evaluated by a video quality filter. Low-quality videos are initially assigned zero sampling weight to stabilize early-stage training. The AdaMix module further adaptively reweights training samples based on trajectory performance metrics, up-weighting challenging samples to improve policy robustness and generalization.
  • Figure 3: Overview of the DreamTransfer framework. Multi-view depth maps are concatenated along the width dimension. The main branch denoises latent video tokens, while a parallel ControlNet branch ensures geometric consistency by incorporating depth constraints.
  • Figure 4: Impact of data mix ratios on real-world robotic tasks performance.
  • Figure 5: Visualization results compared to the state-of-the-art robot manipulation video transfer models. The results demonstrate that DreamTransfer significantly outperforms other models. DreamTransfer generates videos with superior multi-view consistency, more accurate 3D structure preservation, and higher geometrical plausibility under text-controlled appearance editing.
  • ...and 1 more figures