Table of Contents
Fetching ...

Bench2Drive-R: Turning Real World Data into Reactive Closed-Loop Autonomous Driving Benchmark by Generative Model

Junqi You, Xiaosong Jia, Zhiyuan Zhang, Yutao Zhu, Junchi Yan

TL;DR

<3-5 sentence high-level summary> The paper tackles the lack of realistic reactive closed-loop evaluation for end-to-end autonomous driving models. It introduces Bench2Drive-R, a simulation-oriented generative framework that decouples rendering from behavior via a reactive behavioral controller (nuPlan-based) and a diffusion-based renderer with ControlNet, enabling autoregressive, temporally-consistent image generation conditioned on scene state. It adds temporal noise modulation, 3D spatial encodings, and retrieval-based scene-level control to achieve high fidelity and spatial-temporal coherence. Experiments show state-of-the-art generation fidelity on nuScenes, improved closed-loop perception metrics and competitive closed-loop driving evaluation when integrated with nuPlan, with open-sourcing of the codebase.

Abstract

For end-to-end autonomous driving (E2E-AD), the evaluation system remains an open problem. Existing closed-loop evaluation protocols usually rely on simulators like CARLA being less realistic; while NAVSIM using real-world vision data, yet is limited to fixed planning trajectories in short horizon and assumes other agents are not reactive. We introduce Bench2Drive-R, a generative framework that enables reactive closed-loop evaluation. Unlike existing video generative models for AD, the proposed designs are tailored for interactive simulation, where sensor rendering and behavior rollout are decoupled by applying a separate behavioral controller to simulate the reactions of surrounding agents. As a result, the renderer could focus on image fidelity, control adherence, and spatial-temporal coherence. For temporal consistency, due to the step-wise interaction nature of simulation, we design a noise modulating temporal encoder with Gaussian blurring to encourage long-horizon autoregressive rollout of image sequences without deteriorating distribution shifts. For spatial consistency, a retrieval mechanism, which takes the spatially nearest images as references, is introduced to to ensure scene-level rendering fidelity during the generation process. The spatial relations between target and reference are explicitly modeled with 3D relative position encodings and the potential over-reliance of reference images is mitigated with hierarchical sampling and classifier-free guidance. We compare the generation quality of Bench2Drive-R with existing generative models and achieve state-of-the-art performance. We further integrate Bench2Drive-R into nuPlan and evaluate the generative qualities with closed-loop simulation results. We will open source our code.

Bench2Drive-R: Turning Real World Data into Reactive Closed-Loop Autonomous Driving Benchmark by Generative Model

TL;DR

<3-5 sentence high-level summary> The paper tackles the lack of realistic reactive closed-loop evaluation for end-to-end autonomous driving models. It introduces Bench2Drive-R, a simulation-oriented generative framework that decouples rendering from behavior via a reactive behavioral controller (nuPlan-based) and a diffusion-based renderer with ControlNet, enabling autoregressive, temporally-consistent image generation conditioned on scene state. It adds temporal noise modulation, 3D spatial encodings, and retrieval-based scene-level control to achieve high fidelity and spatial-temporal coherence. Experiments show state-of-the-art generation fidelity on nuScenes, improved closed-loop perception metrics and competitive closed-loop driving evaluation when integrated with nuPlan, with open-sourcing of the codebase.

Abstract

For end-to-end autonomous driving (E2E-AD), the evaluation system remains an open problem. Existing closed-loop evaluation protocols usually rely on simulators like CARLA being less realistic; while NAVSIM using real-world vision data, yet is limited to fixed planning trajectories in short horizon and assumes other agents are not reactive. We introduce Bench2Drive-R, a generative framework that enables reactive closed-loop evaluation. Unlike existing video generative models for AD, the proposed designs are tailored for interactive simulation, where sensor rendering and behavior rollout are decoupled by applying a separate behavioral controller to simulate the reactions of surrounding agents. As a result, the renderer could focus on image fidelity, control adherence, and spatial-temporal coherence. For temporal consistency, due to the step-wise interaction nature of simulation, we design a noise modulating temporal encoder with Gaussian blurring to encourage long-horizon autoregressive rollout of image sequences without deteriorating distribution shifts. For spatial consistency, a retrieval mechanism, which takes the spatially nearest images as references, is introduced to to ensure scene-level rendering fidelity during the generation process. The spatial relations between target and reference are explicitly modeled with 3D relative position encodings and the potential over-reliance of reference images is mitigated with hierarchical sampling and classifier-free guidance. We compare the generation quality of Bench2Drive-R with existing generative models and achieve state-of-the-art performance. We further integrate Bench2Drive-R into nuPlan and evaluate the generative qualities with closed-loop simulation results. We will open source our code.

Paper Structure

This paper contains 26 sections, 5 equations, 17 figures, 6 tables.

Figures (17)

  • Figure 1: Different Paradigms of Generative Models for Autonomous Driving: (a) Single-Frame Image Generationswerdlow2024bevgenyang2023bevcontrolgao2024magicdrive, as relatively early works, do not account for temporal generation. (b) Controllable Video Generationwen2024panaceaplusma2024delphiwang2023drivedreamer focuses on generating videos with controls for each frame, which is not suitable for interactive simulation. (c) Predictive Video Generationgao2024vistawang2023drivingfuturemultiviewvisualhu2023gaia emphasizes the annotation-free training ability, which lacks the ability to adhere to control. (d) Interactive Image Generation: The proposed framework leverages the power of generative models in an autoregressive manner, enabling high-frequency interactions with end-to-end driving models and generating temporally consistent images. The integration of a rule-based behavioral controller simulates the behavior of other driving agents to ensure a coherent world state and provides layout controls for the generative part of the framework.
  • Figure 2: Overall Framework: The proposed Bench2Drive-R is composed of two parts: a behavioral controller that executes ego actions and generates behaviors of other driving agents; a generative renderer that produces multi-view sensor images in an autoregressive manner. To improve fidelity, the generative renderer (1) utilizes previous-frame image for temporal consistency; (2) retrieves spatially nearest reference image pair for background prior; (3) adheres to projected layout element controls for object-level consistency.
  • Figure 3: Structural Design of Generative Renderer. We design three additive modules to ensure controllable, consistent, and interactive image generation. a) Temporal consistency module incorporates previous frame images. Noise modulation module helps prevent distribution drift during autoregressive rollout. b) Projected Object-Level Control allows fine-grind controls over the location and orientation of driving vehicles in the scenario. c) Retrieval Scene-Level Control ensures spatial consistency by extracting multi-level features from nearest reference image pairs injecting them into the ControlNet with attention mechanism.
  • Figure 4: Deal with Autoregressive Distribution Shift.a) During training, due to the resemblance between previous and current frame prior, the model would overly rely on previous frame. During inference, the generation errors (artifacts) will cumulate and finally collapse. b) Adding Gaussian blurring and a random level of noise to previous images can destroy obvious pixel-level relations between the two frames. Noise level $n$ is fed as inputs to give model hints on the corrupted extent. As a result, the model can adapt to degenerated previous images and learn to extract high-level prior information instead of pixel-level copy.
  • Figure 5: Designs for Retrieval based Scene-Level Control.a) Reference images are utilized in ControlNet with an additional cross attention module (Ref-Attn). b) Pixel-level 3D position encodings are calculated and fed into cross attention to provide spatial relations between reference and current images. c) E2E-AD agents might not follow logged trajectory during inference, leading to train-val gap.
  • ...and 12 more figures