All Robots in One: A New Standard and Unified Dataset for Versatile, General-Purpose Embodied Agents
Zhiqiang Wang, Hao Zheng, Yunshuang Nie, Wenjun Xu, Qingwei Wang, Hua Ye, Zhe Li, Kaidong Zhang, Xuewen Cheng, Wanxi Dong, Chang Cai, Liang Lin, Feng Zheng, Xiaodan Liang
TL;DR
The paper addresses the fragmentation and modality gaps in embodied AI datasets by introducing ARIO, a standardized, timestamp-based data framework that unifies real, simulated, and transformed data across diverse robot morphologies. Building on this standard, the authors assemble a large-scale ARIO dataset (~3 million episodes, 258 series, 321,064 tasks) from real-world collection, multiple simulators, and open-source conversions, encompassing five modalities (images, 3D, sound, text, tactile) and a scene-series-task-episode structure. The key contributions are the ARIO standard, a comprehensive multi-modal dataset, and extensive statistics demonstrating broad coverage of scenes, skills, and robot configurations, enabling robust cross-embodiment learning and sim-to-real research. The work lays groundwork for scalable, generalizable embodied AI and invites further exploration into large-scale model training, richer modalities, and deeper sim-to-real alignment.
Abstract
Embodied AI is transforming how AI systems interact with the physical world, yet existing datasets are inadequate for developing versatile, general-purpose agents. These limitations include a lack of standardized formats, insufficient data diversity, and inadequate data volume. To address these issues, we introduce ARIO (All Robots In One), a new data standard that enhances existing datasets by offering a unified data format, comprehensive sensory modalities, and a combination of real-world and simulated data. ARIO aims to improve the training of embodied AI agents, increasing their robustness and adaptability across various tasks and environments. Building upon the proposed new standard, we present a large-scale unified ARIO dataset, comprising approximately 3 million episodes collected from 258 series and 321,064 tasks. The ARIO standard and dataset represent a significant step towards bridging the gaps of existing data resources. By providing a cohesive framework for data collection and representation, ARIO paves the way for the development of more powerful and versatile embodied AI agents, capable of navigating and interacting with the physical world in increasingly complex and diverse ways. The project is available on https://imaei.github.io/project_pages/ario/
