Table of Contents
Fetching ...

IGen: Scalable Data Generation for Robot Learning from Open-World Images

Chenghao Gu, Haolan Kang, Junchao Lin, Jinghe Wang, Duo Wu, Shuzhao Xie, Fanding Huang, Junchen Ge, Ziyang Gong, Letian Li, Hongying Zheng, Changwei Lv, Zhi Wang

TL;DR

IGen addresses the data bottleneck in robotic policy learning by converting open-world images into grounded visuomotor data. It reconstructs 3D scenes from a single image, uses vision-language models for high-level planning, and generates executable SE(3) actions that are rendered into temporally coherent observations. The framework demonstrates strong visual fidelity, reliable action generation, and policy transfer to real-world tasks, even outperforming real-robot data in some settings. This enables annotation-free, scalable data generation for training generalist robot policies using open-world imagery.

Abstract

The rise of generalist robotic policies has created an exponential demand for large-scale training data. However, on-robot data collection is labor-intensive and often limited to specific environments. In contrast, open-world images capture a vast diversity of real-world scenes that naturally align with robotic manipulation tasks, offering a promising avenue for low-cost, large-scale robot data acquisition. Despite this potential, the lack of associated robot actions hinders the practical use of open-world images for robot learning, leaving this rich visual resource largely unexploited. To bridge this gap, we propose IGen, a framework that scalably generates realistic visual observations and executable actions from open-world images. IGen first converts unstructured 2D pixels into structured 3D scene representations suitable for scene understanding and manipulation. It then leverages the reasoning capabilities of vision-language models to transform scene-specific task instructions into high-level plans and generate low-level actions as SE(3) end-effector pose sequences. From these poses, it synthesizes dynamic scene evolution and renders temporally coherent visual observations. Experiments validate the high quality of visuomotor data generated by IGen, and show that policies trained solely on IGen-synthesized data achieve performance comparable to those trained on real-world data. This highlights the potential of IGen to support scalable data generation from open-world images for generalist robotic policy training.

IGen: Scalable Data Generation for Robot Learning from Open-World Images

TL;DR

IGen addresses the data bottleneck in robotic policy learning by converting open-world images into grounded visuomotor data. It reconstructs 3D scenes from a single image, uses vision-language models for high-level planning, and generates executable SE(3) actions that are rendered into temporally coherent observations. The framework demonstrates strong visual fidelity, reliable action generation, and policy transfer to real-world tasks, even outperforming real-robot data in some settings. This enables annotation-free, scalable data generation for training generalist robot policies using open-world imagery.

Abstract

The rise of generalist robotic policies has created an exponential demand for large-scale training data. However, on-robot data collection is labor-intensive and often limited to specific environments. In contrast, open-world images capture a vast diversity of real-world scenes that naturally align with robotic manipulation tasks, offering a promising avenue for low-cost, large-scale robot data acquisition. Despite this potential, the lack of associated robot actions hinders the practical use of open-world images for robot learning, leaving this rich visual resource largely unexploited. To bridge this gap, we propose IGen, a framework that scalably generates realistic visual observations and executable actions from open-world images. IGen first converts unstructured 2D pixels into structured 3D scene representations suitable for scene understanding and manipulation. It then leverages the reasoning capabilities of vision-language models to transform scene-specific task instructions into high-level plans and generate low-level actions as SE(3) end-effector pose sequences. From these poses, it synthesizes dynamic scene evolution and renders temporally coherent visual observations. Experiments validate the high quality of visuomotor data generated by IGen, and show that policies trained solely on IGen-synthesized data achieve performance comparable to those trained on real-world data. This highlights the potential of IGen to support scalable data generation from open-world images for generalist robotic policy training.

Paper Structure

This paper contains 18 sections, 3 equations, 17 figures, 1 table.

Figures (17)

  • Figure 1: We propose IGen, a data generation framework that converts open-world images into grounded visuomotor data, enabling scalable data synthesis for robot learning. From a single image, IGen generates large-scale realistic observations and reliable actions. The policies trained solely on IGen-generated data can effectively generalize to real-world scenes and successfully perform manipulation tasks.
  • Figure 2: Overview of IGen. Given an open-world image and a task description, IGen first reconstructs the environment and objects as point clouds via Foundation Vision Models. After spatial keypoint extraction, VLM maps the task description to high-level plans and low-level control commands. During the robot’s execution in simulation, a virtual depth camera captures the motion point cloud sequences. The resulting end-effector pose trajectory is used to synthesize dynamic point-cloud sequences, which are then rendered frame-by-frame into visual observations of the manipulation. The final output consists of the generated robot actions and the visual observations.
  • Figure 3: Qualitative comparison of robotic behavior generation using IGen. Given a single captured image and a natural-language manipulation instruction, TesserAct zhen2025tesseract, Cosmos agarwal2025cosmos, and our IGen generate behavior observations. IGen produces more instruction-consistent and physically coherent object motions, closely matching the intended tasks. The green box represents action observations that adhere to physical laws and follow the task instructions, and the checkmark indicates task completion.
  • Figure 4: Quantitative Comparison of robotic behavior generated by IGen. Performance is assessed on DreamGen Bench jang2025dreamgen under two criteria: Instruction Following and Physics Alignment. Evaluations are conducted using GPT-4o achiam2023gpt, Qwen-3-VL-Plus bai2025qwen2 and GLM-4.5V zeng2025glm as video assessment models. Each method generates 40 videos along with the prompts, and the reported metric represents the proportion of videos receiving a score of 1 from the evaluator.
  • Figure 5: Evaluation of IGen’s computational efficiency. We compare the video generation time and GPU memory consumption of IGen and baselines under identical input images and task instructions. The average computation time refers to the time required to generate one robot behavior video.
  • ...and 12 more figures