Table of Contents
Fetching ...

RealEngine: Simulating Autonomous Driving in Realistic Context

Junzhe Jiang, Nan Song, Jingyu Li, Xiatian Zhu, Li Zhang

TL;DR

RealEngine presents a driving simulation framework that unifies background scene reconstruction and foreground traffic-participant modeling to deliver photorealistic, multi-modal sensor rendering in a closed-loop setting. It enables flexible scene composition, multi-agent interaction, and safety-critical evaluations across non-reactive, safety-test, and multi-agent scenarios. The approach leverages StreetGaussians and GS-LiDAR for efficient background reconstruction, 3D meshes for foreground agents, diffusion-guided lighting, and differentiable relighting to bridge the gap between realism and controllability. Through comprehensive experiments on Navsim/nuPlan data, RealEngine demonstrates improved reconstruction fidelity, stable closed-loop trajectories, and meaningful PDMS-based assessments, offering a practical benchmark for real-world driving performance. This work has significant implications for robust evaluation and development of autonomous driving systems in realistic, diverse, and interactive contexts.

Abstract

Driving simulation plays a crucial role in developing reliable driving agents by providing controlled, evaluative environments. To enable meaningful assessments, a high-quality driving simulator must satisfy several key requirements: multi-modal sensing capabilities (e.g., camera and LiDAR) with realistic scene rendering to minimize observational discrepancies; closed-loop evaluation to support free-form trajectory behaviors; highly diverse traffic scenarios for thorough evaluation; multi-agent cooperation to capture interaction dynamics; and high computational efficiency to ensure affordability and scalability. However, existing simulators and benchmarks fail to comprehensively meet these fundamental criteria. To bridge this gap, this paper introduces RealEngine, a novel driving simulation framework that holistically integrates 3D scene reconstruction and novel view synthesis techniques to achieve realistic and flexible closed-loop simulation in the driving context. By leveraging real-world multi-modal sensor data, RealEngine reconstructs background scenes and foreground traffic participants separately, allowing for highly diverse and realistic traffic scenarios through flexible scene composition. This synergistic fusion of scene reconstruction and view synthesis enables photorealistic rendering across multiple sensor modalities, ensuring both perceptual fidelity and geometric accuracy. Building upon this environment, RealEngine supports three essential driving simulation categories: non-reactive simulation, safety testing, and multi-agent interaction, collectively forming a reliable and comprehensive benchmark for evaluating the real-world performance of driving agents.

RealEngine: Simulating Autonomous Driving in Realistic Context

TL;DR

RealEngine presents a driving simulation framework that unifies background scene reconstruction and foreground traffic-participant modeling to deliver photorealistic, multi-modal sensor rendering in a closed-loop setting. It enables flexible scene composition, multi-agent interaction, and safety-critical evaluations across non-reactive, safety-test, and multi-agent scenarios. The approach leverages StreetGaussians and GS-LiDAR for efficient background reconstruction, 3D meshes for foreground agents, diffusion-guided lighting, and differentiable relighting to bridge the gap between realism and controllability. Through comprehensive experiments on Navsim/nuPlan data, RealEngine demonstrates improved reconstruction fidelity, stable closed-loop trajectories, and meaningful PDMS-based assessments, offering a practical benchmark for real-world driving performance. This work has significant implications for robust evaluation and development of autonomous driving systems in realistic, diverse, and interactive contexts.

Abstract

Driving simulation plays a crucial role in developing reliable driving agents by providing controlled, evaluative environments. To enable meaningful assessments, a high-quality driving simulator must satisfy several key requirements: multi-modal sensing capabilities (e.g., camera and LiDAR) with realistic scene rendering to minimize observational discrepancies; closed-loop evaluation to support free-form trajectory behaviors; highly diverse traffic scenarios for thorough evaluation; multi-agent cooperation to capture interaction dynamics; and high computational efficiency to ensure affordability and scalability. However, existing simulators and benchmarks fail to comprehensively meet these fundamental criteria. To bridge this gap, this paper introduces RealEngine, a novel driving simulation framework that holistically integrates 3D scene reconstruction and novel view synthesis techniques to achieve realistic and flexible closed-loop simulation in the driving context. By leveraging real-world multi-modal sensor data, RealEngine reconstructs background scenes and foreground traffic participants separately, allowing for highly diverse and realistic traffic scenarios through flexible scene composition. This synergistic fusion of scene reconstruction and view synthesis enables photorealistic rendering across multiple sensor modalities, ensuring both perceptual fidelity and geometric accuracy. Building upon this environment, RealEngine supports three essential driving simulation categories: non-reactive simulation, safety testing, and multi-agent interaction, collectively forming a reliable and comprehensive benchmark for evaluating the real-world performance of driving agents.

Paper Structure

This paper contains 21 sections, 7 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: The working mechanism of RealEngine. It consists of three modules: a driving agent, a motion controller and a multi-model renderer. Given a traffic scene represented by multimodal data including multi-view images and LiDAR point cloud, the driving agent predicts the trajectory, according to which RealEngine updates the ego-motion state for all traffic participants. Moving to the next time step, the multimodal sensor data will be refreshed by the current ego-motion state, which is then used for the driving agent to make the next trajectory planing. We consider three driving situations: non-reactive, safety test, and multi-agent interaction.
  • Figure 2: Scene composition. We start with modeling the background scene based on real sensor data and obtaining the meshes of traffic participants either by extracting them from the reconstructed data or through manual creation. That makes a rich space for designing a variety of customizable traffic scenarios. To create a specific scenario, we select both the background scene and traffic participants, which can be integrated based on each participant's spatial coordinates over time. This naturally enables the creation of highly diverse scenes to support extensive closed-loop simulation.
  • Figure 8: Distortion and color inconsistency in the nuPlan benchmark. (a) The camera images in nuPlan exhibit significant barrel distortion, which poses challenges for Gaussian splatting based reconstruction. To address this, we apply distortion correction to the images. (b) Additionally, different cameras have varying exposure levels. To prevent color ambiguity for the same Gaussian primitive across different cameras, we learn an exposure transformation for each camera separately.
  • Figure 9: Lane change reconstruction. We render the camera images by shifting the driving perspective 3 meters to the left and right. Due to the limited viewpoint of driving videos and the vastness of the driving environment, large translations from driving perspectives lead to a significant decline in reconstruction quality. To address this, we use DriveX yang2024drivex, which leverages video generative prior to optimize the scene Gaussian primitives, resulting in improved reconstruction quality even with large viewpoint translations.
  • Figure 10: LiDAR reconstruction. Although reprojection may lead to some loss of LiDAR information, its impact on the low-resolution histogram used by the agent is minimal. Meanwhile, GS-LiDAR jiang2025gslidar achieves high-quality reconstruction of the reprojected LiDAR data.
  • ...and 6 more figures