Table of Contents
Fetching ...

A Survey: Learning Embodied Intelligence from Physical Simulators and World Models

Xiaoxiao Long, Qingrui Zhao, Kaiwen Zhang, Zihao Zhang, Dingrui Wang, Yumeng Liu, Zhengjie Shu, Yi Lu, Shouzheng Wang, Xinzhe Wei, Wei Li, Wei Yin, Yao Yao, Jia Pan, Qiu Shen, Ruigang Yang, Xun Cao, Qionghai Dai

TL;DR

The paper proposes a five-level IR-L0 to IR-L4 framework to evaluate humanoid robot autonomy and social cognition, and surveys the complementary roles of physical simulators and world models in embodied AI. It analyzes how simulators provide safe, controllable training environments while world models offer internal predictive capabilities for planning, reward inference, and long-horizon decision making. The review covers mobility, manipulation, and human-robot interaction, compares mainstream simulators and their physics/rendering capabilities, and surveys a wide range of world-model architectures (RSSM, JEPA, transformer/diffusion) and applications (autonomous driving, articulated robots). The work highlights trends toward diffusion-based world models, multi-modal conditioning, and occupancy-based world representations, arguing that the integration of external simulation with internal modeling is key to achieving robust sim-to-real transfer and progress toward IR-L4 autonomy. It also provides a repository for up-to-date literature and emphasizes open challenges in data efficiency, generalization, causality, and evaluation in embodied AI.

Abstract

The pursuit of artificial general intelligence (AGI) has placed embodied intelligence at the forefront of robotics research. Embodied intelligence focuses on agents capable of perceiving, reasoning, and acting within the physical world. Achieving robust embodied intelligence requires not only advanced perception and control, but also the ability to ground abstract cognition in real-world interactions. Two foundational technologies, physical simulators and world models, have emerged as critical enablers in this quest. Physical simulators provide controlled, high-fidelity environments for training and evaluating robotic agents, allowing safe and efficient development of complex behaviors. In contrast, world models empower robots with internal representations of their surroundings, enabling predictive planning and adaptive decision-making beyond direct sensory input. This survey systematically reviews recent advances in learning embodied AI through the integration of physical simulators and world models. We analyze their complementary roles in enhancing autonomy, adaptability, and generalization in intelligent robots, and discuss the interplay between external simulation and internal modeling in bridging the gap between simulated training and real-world deployment. By synthesizing current progress and identifying open challenges, this survey aims to provide a comprehensive perspective on the path toward more capable and generalizable embodied AI systems. We also maintain an active repository that contains up-to-date literature and open-source projects at https://github.com/NJU3DV-LoongGroup/Embodied-World-Models-Survey.

A Survey: Learning Embodied Intelligence from Physical Simulators and World Models

TL;DR

The paper proposes a five-level IR-L0 to IR-L4 framework to evaluate humanoid robot autonomy and social cognition, and surveys the complementary roles of physical simulators and world models in embodied AI. It analyzes how simulators provide safe, controllable training environments while world models offer internal predictive capabilities for planning, reward inference, and long-horizon decision making. The review covers mobility, manipulation, and human-robot interaction, compares mainstream simulators and their physics/rendering capabilities, and surveys a wide range of world-model architectures (RSSM, JEPA, transformer/diffusion) and applications (autonomous driving, articulated robots). The work highlights trends toward diffusion-based world models, multi-modal conditioning, and occupancy-based world representations, arguing that the integration of external simulation with internal modeling is key to achieving robust sim-to-real transfer and progress toward IR-L4 autonomy. It also provides a repository for up-to-date literature and emphasizes open challenges in data efficiency, generalization, causality, and evaluation in embodied AI.

Abstract

The pursuit of artificial general intelligence (AGI) has placed embodied intelligence at the forefront of robotics research. Embodied intelligence focuses on agents capable of perceiving, reasoning, and acting within the physical world. Achieving robust embodied intelligence requires not only advanced perception and control, but also the ability to ground abstract cognition in real-world interactions. Two foundational technologies, physical simulators and world models, have emerged as critical enablers in this quest. Physical simulators provide controlled, high-fidelity environments for training and evaluating robotic agents, allowing safe and efficient development of complex behaviors. In contrast, world models empower robots with internal representations of their surroundings, enabling predictive planning and adaptive decision-making beyond direct sensory input. This survey systematically reviews recent advances in learning embodied AI through the integration of physical simulators and world models. We analyze their complementary roles in enhancing autonomy, adaptability, and generalization in intelligent robots, and discuss the interplay between external simulation and internal modeling in bridging the gap between simulated training and real-world deployment. By synthesizing current progress and identifying open challenges, this survey aims to provide a comprehensive perspective on the path toward more capable and generalizable embodied AI systems. We also maintain an active repository that contains up-to-date literature and open-source projects at https://github.com/NJU3DV-LoongGroup/Embodied-World-Models-Survey.

Paper Structure

This paper contains 71 sections, 25 figures, 7 tables.

Figures (25)

  • Figure 1: Physical simulator and world model play vital roles for embodied intelligence. Simulator provides an explicit modeling of the real world, offering a controlled environment where robots can train, test, and refine their behaviors. World model offers internal representations of the environment, enabling robots to autonomously simulate, predict, and plan actions within their cognitive framework.
  • Figure 2: Survey outline: We categorize intelligent robot development into five levels (IR-L0 to IR-L4) and review progress and key techniques in robotic mobility, manipulation, and interaction (§\ref{['sec:robots_mdi']}), the use of physical simulators for learning and control algorithm verification (§\ref{['sec:simulator']}), and the design and use of world models as internal representations for learning, planning, and decision-making (§\ref{['sec:world_models']}–§\ref{['sec:world_models_for_AD_robot']}), emphasizing both explicit and implicit learning pathways.
  • Figure 3: Levels of Intelligent Robotics: From Basic Execution to Full Autonomy.
  • Figure 4: Timeline of advancements in unstructured environment adaption of humanoid robot.
  • Figure 5: Timeline of advancements in high dynamic movements of humanoid robot.
  • ...and 20 more figures