Table of Contents
Fetching ...

ReSim: Reliable World Simulation for Autonomous Driving

Jiazhi Yang, Kashyap Chitta, Shenyuan Gao, Long Chen, Yuqian Shao, Xiaosong Jia, Hongyang Li, Andreas Geiger, Xiangyu Yue, Li Chen

TL;DR

ReSim tackles the scarcity of hazardous-action data in real-world driving by enriching real driving logs with simulator-generated non-expert data and by training a controllable diffusion-based world model conditioned on history, commands, and future waypoints. A dynamics-consistency loss, unbalanced noise sampling, and a multi-stage training regime yield high-fidelity, action-following futures, while Video2Reward provides reward signals from simulated videos to guide planning. Empirical results show up to 44% higher visual fidelity and over 50% gains in action controllability, with NAVSIM planning and reward-guided policy selection improving by several points. The work enables closed-loop visual simulation and reward-driven decision making, offering a practical path toward robust open-world policy evaluation, albeit with current limitations in inference efficiency and benchmarking in fully fair closed-loop settings.

Abstract

How can we reliably simulate future driving scenarios under a wide range of ego driving behaviors? Recent driving world models, developed exclusively on real-world driving data composed mainly of safe expert trajectories, struggle to follow hazardous or non-expert behaviors, which are rare in such data. This limitation restricts their applicability to tasks such as policy evaluation. In this work, we address this challenge by enriching real-world human demonstrations with diverse non-expert data collected from a driving simulator (e.g., CARLA), and building a controllable world model trained on this heterogeneous corpus. Starting with a video generator featuring a diffusion transformer architecture, we devise several strategies to effectively integrate conditioning signals and improve prediction controllability and fidelity. The resulting model, ReSim, enables Reliable Simulation of diverse open-world driving scenarios under various actions, including hazardous non-expert ones. To close the gap between high-fidelity simulation and applications that require reward signals to judge different actions, we introduce a Video2Reward module that estimates a reward from ReSim's simulated future. Our ReSim paradigm achieves up to 44% higher visual fidelity, improves controllability for both expert and non-expert actions by over 50%, and boosts planning and policy selection performance on NAVSIM by 2% and 25%, respectively.

ReSim: Reliable World Simulation for Autonomous Driving

TL;DR

ReSim tackles the scarcity of hazardous-action data in real-world driving by enriching real driving logs with simulator-generated non-expert data and by training a controllable diffusion-based world model conditioned on history, commands, and future waypoints. A dynamics-consistency loss, unbalanced noise sampling, and a multi-stage training regime yield high-fidelity, action-following futures, while Video2Reward provides reward signals from simulated videos to guide planning. Empirical results show up to 44% higher visual fidelity and over 50% gains in action controllability, with NAVSIM planning and reward-guided policy selection improving by several points. The work enables closed-loop visual simulation and reward-driven decision making, offering a practical path toward robust open-world policy evaluation, albeit with current limitations in inference efficiency and benchmarking in fully fair closed-loop settings.

Abstract

How can we reliably simulate future driving scenarios under a wide range of ego driving behaviors? Recent driving world models, developed exclusively on real-world driving data composed mainly of safe expert trajectories, struggle to follow hazardous or non-expert behaviors, which are rare in such data. This limitation restricts their applicability to tasks such as policy evaluation. In this work, we address this challenge by enriching real-world human demonstrations with diverse non-expert data collected from a driving simulator (e.g., CARLA), and building a controllable world model trained on this heterogeneous corpus. Starting with a video generator featuring a diffusion transformer architecture, we devise several strategies to effectively integrate conditioning signals and improve prediction controllability and fidelity. The resulting model, ReSim, enables Reliable Simulation of diverse open-world driving scenarios under various actions, including hazardous non-expert ones. To close the gap between high-fidelity simulation and applications that require reward signals to judge different actions, we introduce a Video2Reward module that estimates a reward from ReSim's simulated future. Our ReSim paradigm achieves up to 44% higher visual fidelity, improves controllability for both expert and non-expert actions by over 50%, and boosts planning and policy selection performance on NAVSIM by 2% and 25%, respectively.

Paper Structure

This paper contains 28 sections, 2 equations, 18 figures, 5 tables.

Figures (18)

  • Figure 1: Overview of ReSim. (a) Heterogeneous driving data includes (i,ii) experts' safe driving logs, and (iii) potentially dangerous (non-expert) driving behaviors from simulations. (b) Prior driving world models are trained on expert data solely, leading to consistently safe yet inaccurate imaginations; in ReSim, we leverage all sources of data to simulate reliable and realistic futures, and build a robust reward model that generalizes to open-world scenarios within the simulator. (c) The high-fidelity prediction, accurate action-following, and reward estimation abilities of ReSim facilitate driving applications related to both policy deployment and simulation.
  • Figure 2: Video2Reward model (V2R).Top: V2R is supervised by infraction score of both safe and hazardous data from simulation, deriving the reward from a driving video. Bottom: In real-world inference, the predicted video of ReSim in reaction to a proposed action is fed into V2R to estimate the action's reward.
  • Figure 3: Video prediction-based policy. ReSim conditions on the history context (left) to synthesize a plausible visual plan (middle), which is then translated into an ego trajectory via an IDM (right).
  • Figure 4: Human evaluation of non-expert action controllability. ReSim gets the most votes in both realism and trajectory following.
  • Figure 5: Qualitative comparisons of non-expert action controllability. ReSim reliably simulates hazardous outcomes from the non-expert action, while other methods either fail to follow the specified trajectory or compromise the scenario's consistency. $^\star$: without simulated data in training.
  • ...and 13 more figures