Table of Contents
Fetching ...

RLGF: Reinforcement Learning with Geometric Feedback for Autonomous Driving Video Generation

Tianyi Yan, Wencheng Han, Xia Zhou, Xueyang Zhang, Kun Zhan, Cheng-zhong Xu, Jianbing Shen

TL;DR

RLGF tackles geometric distortions in diffusion-based autonomous driving video generation by injecting perception-based geometric rewards into a latent-space reinforcement learning framework. It introduces Latent-Space Windowing Optimization to provide targeted feedback during diffusion and a Hierarchical Geometric Reward that combines point-line-plane and scene-occupancy signals from latent perception models. The approach yields substantial improvements in 3D object detection accuracy and reduces geometric gaps relative to real data, while preserving visual realism. This plug-and-play method enables more geometrically faithful synthetic AD data for training and validating perception systems.

Abstract

Synthetic data is crucial for advancing autonomous driving (AD) systems, yet current state-of-the-art video generation models, despite their visual realism, suffer from subtle geometric distortions that limit their utility for downstream perception tasks. We identify and quantify this critical issue, demonstrating a significant performance gap in 3D object detection when using synthetic versus real data. To address this, we introduce Reinforcement Learning with Geometric Feedback (RLGF), RLGF uniquely refines video diffusion models by incorporating rewards from specialized latent-space AD perception models. Its core components include an efficient Latent-Space Windowing Optimization technique for targeted feedback during diffusion, and a Hierarchical Geometric Reward (HGR) system providing multi-level rewards for point-line-plane alignment, and scene occupancy coherence. To quantify these distortions, we propose GeoScores. Applied to models like DiVE on nuScenes, RLGF substantially reduces geometric errors (e.g., VP error by 21\%, Depth error by 57\%) and dramatically improves 3D object detection mAP by 12.7\%, narrowing the gap to real-data performance. RLGF offers a plug-and-play solution for generating geometrically sound and reliable synthetic videos for AD development.

RLGF: Reinforcement Learning with Geometric Feedback for Autonomous Driving Video Generation

TL;DR

RLGF tackles geometric distortions in diffusion-based autonomous driving video generation by injecting perception-based geometric rewards into a latent-space reinforcement learning framework. It introduces Latent-Space Windowing Optimization to provide targeted feedback during diffusion and a Hierarchical Geometric Reward that combines point-line-plane and scene-occupancy signals from latent perception models. The approach yields substantial improvements in 3D object detection accuracy and reduces geometric gaps relative to real data, while preserving visual realism. This plug-and-play method enables more geometrically faithful synthetic AD data for training and validating perception systems.

Abstract

Synthetic data is crucial for advancing autonomous driving (AD) systems, yet current state-of-the-art video generation models, despite their visual realism, suffer from subtle geometric distortions that limit their utility for downstream perception tasks. We identify and quantify this critical issue, demonstrating a significant performance gap in 3D object detection when using synthetic versus real data. To address this, we introduce Reinforcement Learning with Geometric Feedback (RLGF), RLGF uniquely refines video diffusion models by incorporating rewards from specialized latent-space AD perception models. Its core components include an efficient Latent-Space Windowing Optimization technique for targeted feedback during diffusion, and a Hierarchical Geometric Reward (HGR) system providing multi-level rewards for point-line-plane alignment, and scene occupancy coherence. To quantify these distortions, we propose GeoScores. Applied to models like DiVE on nuScenes, RLGF substantially reduces geometric errors (e.g., VP error by 21\%, Depth error by 57\%) and dramatically improves 3D object detection mAP by 12.7\%, narrowing the gap to real-data performance. RLGF offers a plug-and-play solution for generating geometrically sound and reliable synthetic videos for AD development.

Paper Structure

This paper contains 29 sections, 13 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: (a) Original Video Generation Models, optimized via pixel-level supervision (e.g., noise prediction error), often produce visually plausible videos that nonetheless suffer from severe geometric flaws (misaligned planes/lines, wrong perspective). This can degrade downstream tasks like 3D object detection (e.g., mAP drop from 35.5 to 25.7). (b) Our RLGF integrates a Hierarchical Geometry Reward directly into the multi-step denoising process. This reward, derived from perception models, guides the generation model to produce outputs with aligned planes, correct lane structures, and accurate perspective. (c) Visualized depth maps from noisy latents at various denoising stages (from noisy to less noisy) show coarse geometry emerging early and details later. This motivates our Latent-Space Windowing Optimization for targeted intermediate rewards.
  • Figure 2: Overview of the Reinforcement Learning with Geometric Feedback (RLGF) framework. RLGF fine-tunes a frozen well-trained diffusion model via LoRA using rewards from a "Latent Space Windowing" scheme. Within this window, intermediate latents $z_{t'-w}$ are evaluated by frozen perception models $\mathcal{P}_{geo}$ (point-line-plane alignment) and $\mathcal{P}_{occ}$ (scene-level consistency) against a reference video. The resulting rewards ($R_{geo},R_{occ}$) generate gradients (red arrows) to update LoRA, improving geometric and temporal consistency. Black arrows: feed forward; dashed red: gradients.
  • Figure 3: Qualitative comparison of 3D object bounding box alignment. RLGF-enhanced video exhibits much-improved 3D box alignment, closely matching the geometry implied by the scene.
  • Figure 4: Left: Detection results on a real nuScenes image. Right: Detection results on a corresponding synthetic image generated by the DiVE baseline. Bounding boxes indicate detected objects (primarily vehicles).