Table of Contents
Fetching ...

InternData-A1: Pioneering High-Fidelity Synthetic Data for Pre-training Generalist Policy

Yang Tian, Yuyin Yang, Yiman Xie, Zetao Cai, Xu Shi, Ning Gao, Hangxu Liu, Xuekun Jiang, Zherui Qiu, Feng Yuan, Yaping Li, Ping Wang, Junhao Cai, Jia Zeng, Hao Dong, Jiangmiao Pang

TL;DR

This work shows that large-scale, high-fidelity synthetic data can match the best real-robot pre-training for Vision-Language-Action models. The authors present InternData-A1, a decoupled, autonomous data-synthesis pipeline producing 630k trajectories (7,433 hours) across 4 embodiments, 70 tasks, and 227 scenes with photorealistic rendering and broad skill composition. A $\pi_0$ model pre-trained solely on InternData-A1 achieves performance on par with or better than the official $\pi_0$ across 49 simulation tasks and 9 real-world tasks, with direct sim-to-real transfer observed on multiple tasks and favorable comparisons to open-source datasets. The dataset and pipeline are open-source, underscoring the potential of large-scale simulation to reduce barriers to scalable embodied AI data while highlighting remaining challenges in highly dexterous manipulation.

Abstract

Recent works explore how real and synthetic data contribute to Vision-Language-Action (VLA) models' generalization. While current VLA models have shown the strong effectiveness of large-scale real-robot pre-training, synthetic data has not previously demonstrated comparable capability at scale. This paper provides the first evidence that synthetic data alone can match the performance of the strongest $π$-dataset in pre-training a VLA model, revealing the substantial value of large-scale simulation. The resulting model also exhibits surprisingly zero-shot sim-to-real transfer on several challenging tasks. Our synthetic dataset, InternData-A1, contains over 630k trajectories and 7,433 hours across 4 embodiments, 18 skills, 70 tasks, and 227 scenes, covering rigid, articulated, deformable, and fluid-object manipulation. It is generated through a highly autonomous, fully decoupled, and compositional simulation pipeline that enables long-horizon skill composition, flexible task assembly, and heterogeneous embodiments with minimal manual tuning. Using the same architecture as $π_0$, we pre-train a model entirely on InternData-A1 and find that it matches the official $π_0$ across 49 simulation tasks, 5 real-world tasks, and 4 long-horizon dexterous tasks. We release the dataset and will open-source the generation pipeline to broaden access to large-scale robotic data and to lower the barrier to scalable data creation for embodied AI research.

InternData-A1: Pioneering High-Fidelity Synthetic Data for Pre-training Generalist Policy

TL;DR

This work shows that large-scale, high-fidelity synthetic data can match the best real-robot pre-training for Vision-Language-Action models. The authors present InternData-A1, a decoupled, autonomous data-synthesis pipeline producing 630k trajectories (7,433 hours) across 4 embodiments, 70 tasks, and 227 scenes with photorealistic rendering and broad skill composition. A model pre-trained solely on InternData-A1 achieves performance on par with or better than the official across 49 simulation tasks and 9 real-world tasks, with direct sim-to-real transfer observed on multiple tasks and favorable comparisons to open-source datasets. The dataset and pipeline are open-source, underscoring the potential of large-scale simulation to reduce barriers to scalable embodied AI data while highlighting remaining challenges in highly dexterous manipulation.

Abstract

Recent works explore how real and synthetic data contribute to Vision-Language-Action (VLA) models' generalization. While current VLA models have shown the strong effectiveness of large-scale real-robot pre-training, synthetic data has not previously demonstrated comparable capability at scale. This paper provides the first evidence that synthetic data alone can match the performance of the strongest -dataset in pre-training a VLA model, revealing the substantial value of large-scale simulation. The resulting model also exhibits surprisingly zero-shot sim-to-real transfer on several challenging tasks. Our synthetic dataset, InternData-A1, contains over 630k trajectories and 7,433 hours across 4 embodiments, 18 skills, 70 tasks, and 227 scenes, covering rigid, articulated, deformable, and fluid-object manipulation. It is generated through a highly autonomous, fully decoupled, and compositional simulation pipeline that enables long-horizon skill composition, flexible task assembly, and heterogeneous embodiments with minimal manual tuning. Using the same architecture as , we pre-train a model entirely on InternData-A1 and find that it matches the official across 49 simulation tasks, 5 real-world tasks, and 4 long-horizon dexterous tasks. We release the dataset and will open-source the generation pipeline to broaden access to large-scale robotic data and to lower the barrier to scalable data creation for embodied AI research.

Paper Structure

This paper contains 27 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: InternData-A1 pioneers a large-scale, high-fidelity synthetic dataset with physically faithful, photorealistic rendering, diverse object domains (rigid, articulated, fluid, and deformable), extensive multi-skill tasks, and broad cross-embodiment coverage.
  • Figure 2: Data Statistics. InternData-A1 provides 4 single or dual-arm embodiments, 70 diverse tasks, 3185 rigid objects, 321 aritculation objects, 20 garments, and 227 rooms. All these elements consist 630k episodes, 401.4M frames and 7433.9 hours.
  • Figure 3: Data synthesis pipeline of InternData-A1. It consists of four stages: (1) environment construction with selected embodiments, scenes, and objects; (2) task composition using modular atomic skills invoked via simple configuration commands; (3) domain randomization over layouts, object poses, lighting, etc.; and (4) trajectory generation, where CuRobo curobo_report23 interpolates dense joint actions, validates them through physics simulation, and renders only successful trajectories into the LeRobot format.
  • Figure 4: Real-world setup. We evaluate the model pre-trained on InternData-A1 against the $\pi$-dataset baseline on a suite of nine real-world tasks spanning three robots.
  • Figure 5: Comparison to $\pi$-dataset in real-world tasks. InternData-A1 achieves performance comparable to $\pi$-dataset across 9 real-world tasks, including 4 dexterous ones, demonstrating its strong pre-training capability.
  • ...and 2 more figures