ReSim: Reliable World Simulation for Autonomous Driving
Jiazhi Yang, Kashyap Chitta, Shenyuan Gao, Long Chen, Yuqian Shao, Xiaosong Jia, Hongyang Li, Andreas Geiger, Xiangyu Yue, Li Chen
TL;DR
ReSim tackles the scarcity of hazardous-action data in real-world driving by enriching real driving logs with simulator-generated non-expert data and by training a controllable diffusion-based world model conditioned on history, commands, and future waypoints. A dynamics-consistency loss, unbalanced noise sampling, and a multi-stage training regime yield high-fidelity, action-following futures, while Video2Reward provides reward signals from simulated videos to guide planning. Empirical results show up to 44% higher visual fidelity and over 50% gains in action controllability, with NAVSIM planning and reward-guided policy selection improving by several points. The work enables closed-loop visual simulation and reward-driven decision making, offering a practical path toward robust open-world policy evaluation, albeit with current limitations in inference efficiency and benchmarking in fully fair closed-loop settings.
Abstract
How can we reliably simulate future driving scenarios under a wide range of ego driving behaviors? Recent driving world models, developed exclusively on real-world driving data composed mainly of safe expert trajectories, struggle to follow hazardous or non-expert behaviors, which are rare in such data. This limitation restricts their applicability to tasks such as policy evaluation. In this work, we address this challenge by enriching real-world human demonstrations with diverse non-expert data collected from a driving simulator (e.g., CARLA), and building a controllable world model trained on this heterogeneous corpus. Starting with a video generator featuring a diffusion transformer architecture, we devise several strategies to effectively integrate conditioning signals and improve prediction controllability and fidelity. The resulting model, ReSim, enables Reliable Simulation of diverse open-world driving scenarios under various actions, including hazardous non-expert ones. To close the gap between high-fidelity simulation and applications that require reward signals to judge different actions, we introduce a Video2Reward module that estimates a reward from ReSim's simulated future. Our ReSim paradigm achieves up to 44% higher visual fidelity, improves controllability for both expert and non-expert actions by over 50%, and boosts planning and policy selection performance on NAVSIM by 2% and 25%, respectively.
