Table of Contents
Fetching ...

Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI

Yang Liu, Weixing Chen, Yongjie Bai, Xiaodan Liang, Guanbin Li, Wen Gao, Liang Lin

TL;DR

Embodied AI aims to align cyber space with the physical world using multi-modal foundation models and world models. This survey comprehensively maps embodied robots, simulators, four core tasks (perception, interaction, agents, sim-to-real), and introduces ARIO as a unified dataset standard to accelerate general-purpose embodied agents. It highlights the rising prominence of Vision-Language-Action models and world-model driven planning for navigating real and simulated environments, while outlining challenges in data, long-horizon tasks, causal reasoning, and security. The work provides a structured framework and actionable directions to advance scalable, robust embodied systems across diverse domains.

Abstract

Embodied Artificial Intelligence (Embodied AI) is crucial for achieving Artificial General Intelligence (AGI) and serves as a foundation for various applications (e.g., intelligent mechatronics systems, smart manufacturing) that bridge cyberspace and the physical world. Recently, the emergence of Multi-modal Large Models (MLMs) and World Models (WMs) have attracted significant attention due to their remarkable perception, interaction, and reasoning capabilities, making them a promising architecture for embodied agents. In this survey, we give a comprehensive exploration of the latest advancements in Embodied AI. Our analysis firstly navigates through the forefront of representative works of embodied robots and simulators, to fully understand the research focuses and their limitations. Then, we analyze four main research targets: 1) embodied perception, 2) embodied interaction, 3) embodied agent, and 4) sim-to-real adaptation, covering state-of-the-art methods, essential paradigms, and comprehensive datasets. Additionally, we explore the complexities of MLMs in virtual and real embodied agents, highlighting their significance in facilitating interactions in digital and physical environments. Finally, we summarize the challenges and limitations of embodied AI and discuss potential future directions. We hope this survey will serve as a foundational reference for the research community. The associated project can be found at https://github.com/HCPLab-SYSU/Embodied_AI_Paper_List.

Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI

TL;DR

Embodied AI aims to align cyber space with the physical world using multi-modal foundation models and world models. This survey comprehensively maps embodied robots, simulators, four core tasks (perception, interaction, agents, sim-to-real), and introduces ARIO as a unified dataset standard to accelerate general-purpose embodied agents. It highlights the rising prominence of Vision-Language-Action models and world-model driven planning for navigating real and simulated environments, while outlining challenges in data, long-horizon tasks, causal reasoning, and security. The work provides a structured framework and actionable directions to advance scalable, robust embodied systems across diverse domains.

Abstract

Embodied Artificial Intelligence (Embodied AI) is crucial for achieving Artificial General Intelligence (AGI) and serves as a foundation for various applications (e.g., intelligent mechatronics systems, smart manufacturing) that bridge cyberspace and the physical world. Recently, the emergence of Multi-modal Large Models (MLMs) and World Models (WMs) have attracted significant attention due to their remarkable perception, interaction, and reasoning capabilities, making them a promising architecture for embodied agents. In this survey, we give a comprehensive exploration of the latest advancements in Embodied AI. Our analysis firstly navigates through the forefront of representative works of embodied robots and simulators, to fully understand the research focuses and their limitations. Then, we analyze four main research targets: 1) embodied perception, 2) embodied interaction, 3) embodied agent, and 4) sim-to-real adaptation, covering state-of-the-art methods, essential paradigms, and comprehensive datasets. Additionally, we explore the complexities of MLMs in virtual and real embodied agents, highlighting their significance in facilitating interactions in digital and physical environments. Finally, we summarize the challenges and limitations of embodied AI and discuss potential future directions. We hope this survey will serve as a foundational reference for the research community. The associated project can be found at https://github.com/HCPLab-SYSU/Embodied_AI_Paper_List.
Paper Structure (46 sections, 12 figures, 7 tables)

This paper contains 46 sections, 12 figures, 7 tables.

Figures (12)

  • Figure 1: The framework of the embodied agent based on MLMs and WMs, incorporates the ABC model, which stands for AI brain, Body, and Cross-modal sensors. The embodied agent is equipped with an embodied world model as the A model, enabling it to understand the virtual-physical environment. Through the C model, it actively perceives multi-modal elements, enhancing its situational awareness. Meanwhile, the B model endows the agent execute actions, and interact with humans while utilizing tools effectively.
  • Figure 2: The Embodied Robots include Fixed-base Robots, Quadruped Robots, Humanoid Robots, Wheeled Robots, Tracked Robots, and Biomimetic Robots.
  • Figure 3: Examples of General Simulators. The MuJoCo's figure is from wang2020learning.
  • Figure 4: Examples of Real-Scene Based Simulators.
  • Figure 5: The schematic diagram of active visual perception. Visual SLAM and 3D Scene Understanding provide the foundation for passive visual perception, while active exploration provides activeness to the passive perception system. These elements works collaboratively for the active visual perception system.
  • ...and 7 more figures