Bench2Drive-R: Turning Real World Data into Reactive Closed-Loop Autonomous Driving Benchmark by Generative Model

Junqi You; Xiaosong Jia; Zhiyuan Zhang; Yutao Zhu; Junchi Yan

Bench2Drive-R: Turning Real World Data into Reactive Closed-Loop Autonomous Driving Benchmark by Generative Model

Junqi You, Xiaosong Jia, Zhiyuan Zhang, Yutao Zhu, Junchi Yan

TL;DR

<3-5 sentence high-level summary> The paper tackles the lack of realistic reactive closed-loop evaluation for end-to-end autonomous driving models. It introduces Bench2Drive-R, a simulation-oriented generative framework that decouples rendering from behavior via a reactive behavioral controller (nuPlan-based) and a diffusion-based renderer with ControlNet, enabling autoregressive, temporally-consistent image generation conditioned on scene state. It adds temporal noise modulation, 3D spatial encodings, and retrieval-based scene-level control to achieve high fidelity and spatial-temporal coherence. Experiments show state-of-the-art generation fidelity on nuScenes, improved closed-loop perception metrics and competitive closed-loop driving evaluation when integrated with nuPlan, with open-sourcing of the codebase.

Abstract

For end-to-end autonomous driving (E2E-AD), the evaluation system remains an open problem. Existing closed-loop evaluation protocols usually rely on simulators like CARLA being less realistic; while NAVSIM using real-world vision data, yet is limited to fixed planning trajectories in short horizon and assumes other agents are not reactive. We introduce Bench2Drive-R, a generative framework that enables reactive closed-loop evaluation. Unlike existing video generative models for AD, the proposed designs are tailored for interactive simulation, where sensor rendering and behavior rollout are decoupled by applying a separate behavioral controller to simulate the reactions of surrounding agents. As a result, the renderer could focus on image fidelity, control adherence, and spatial-temporal coherence. For temporal consistency, due to the step-wise interaction nature of simulation, we design a noise modulating temporal encoder with Gaussian blurring to encourage long-horizon autoregressive rollout of image sequences without deteriorating distribution shifts. For spatial consistency, a retrieval mechanism, which takes the spatially nearest images as references, is introduced to to ensure scene-level rendering fidelity during the generation process. The spatial relations between target and reference are explicitly modeled with 3D relative position encodings and the potential over-reliance of reference images is mitigated with hierarchical sampling and classifier-free guidance. We compare the generation quality of Bench2Drive-R with existing generative models and achieve state-of-the-art performance. We further integrate Bench2Drive-R into nuPlan and evaluate the generative qualities with closed-loop simulation results. We will open source our code.

Bench2Drive-R: Turning Real World Data into Reactive Closed-Loop Autonomous Driving Benchmark by Generative Model

TL;DR

Abstract

Bench2Drive-R: Turning Real World Data into Reactive Closed-Loop Autonomous Driving Benchmark by Generative Model

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (17)