InternData-A1: Pioneering High-Fidelity Synthetic Data for Pre-training Generalist Policy
Yang Tian, Yuyin Yang, Yiman Xie, Zetao Cai, Xu Shi, Ning Gao, Hangxu Liu, Xuekun Jiang, Zherui Qiu, Feng Yuan, Yaping Li, Ping Wang, Junhao Cai, Jia Zeng, Hao Dong, Jiangmiao Pang
TL;DR
This work shows that large-scale, high-fidelity synthetic data can match the best real-robot pre-training for Vision-Language-Action models. The authors present InternData-A1, a decoupled, autonomous data-synthesis pipeline producing 630k trajectories (7,433 hours) across 4 embodiments, 70 tasks, and 227 scenes with photorealistic rendering and broad skill composition. A $\pi_0$ model pre-trained solely on InternData-A1 achieves performance on par with or better than the official $\pi_0$ across 49 simulation tasks and 9 real-world tasks, with direct sim-to-real transfer observed on multiple tasks and favorable comparisons to open-source datasets. The dataset and pipeline are open-source, underscoring the potential of large-scale simulation to reduce barriers to scalable embodied AI data while highlighting remaining challenges in highly dexterous manipulation.
Abstract
Recent works explore how real and synthetic data contribute to Vision-Language-Action (VLA) models' generalization. While current VLA models have shown the strong effectiveness of large-scale real-robot pre-training, synthetic data has not previously demonstrated comparable capability at scale. This paper provides the first evidence that synthetic data alone can match the performance of the strongest $π$-dataset in pre-training a VLA model, revealing the substantial value of large-scale simulation. The resulting model also exhibits surprisingly zero-shot sim-to-real transfer on several challenging tasks. Our synthetic dataset, InternData-A1, contains over 630k trajectories and 7,433 hours across 4 embodiments, 18 skills, 70 tasks, and 227 scenes, covering rigid, articulated, deformable, and fluid-object manipulation. It is generated through a highly autonomous, fully decoupled, and compositional simulation pipeline that enables long-horizon skill composition, flexible task assembly, and heterogeneous embodiments with minimal manual tuning. Using the same architecture as $π_0$, we pre-train a model entirely on InternData-A1 and find that it matches the official $π_0$ across 49 simulation tasks, 5 real-world tasks, and 4 long-horizon dexterous tasks. We release the dataset and will open-source the generation pipeline to broaden access to large-scale robotic data and to lower the barrier to scalable data creation for embodied AI research.
