Table of Contents
Fetching ...

DualDiff+: Dual-Branch Diffusion for High-Fidelity Video Generation with Reward Guidance

Zhao Yang, Zezhong Qian, Xiaofan Li, Weixiang Xu, Gongpeng Zhao, Ruohong Yu, Lingsi Zhu, Longjun Liu

TL;DR

DualDiff presents a dual-branch diffusion framework for high-fidelity driving scene generation conditioned on rich 3D geometry and multimodal inputs. Key innovations include Occupancy Ray-shape Sampling (ORS) for detailed foreground/background conditioning, Foreground-Aware Mask (FGM) denoising, Semantic Fusion Attention (SFA) for adaptive cross-modal fusion, and Reward-Guided Diffusion (RGD) for coherent image-to-video generation with high-level semantic alignment via $R_{I3D}$. The approach achieves state-of-the-art results on NuScenes and Waymo, including a $4.09\%$ FID reduction on NuScenes, a $32.5\%$ reduction in FVD for video, and meaningful gains in downstream BEV segmentation and 3D object detection (e.g., foreground mAP +$1.46\%$, road mIoU +$1.70\%$, vehicle mIoU +$4.50\%$). A data-centric closed-loop training strategy with corner-case sampling further improves downstream perception, demonstrating practical benefits for autonomous-driving perception pipelines. Overall, DualDiff advances geometry-aware, multimodal conditional generation with improved temporal coherence and task relevance for synthetic driving data.

Abstract

Accurate and high-fidelity driving scene reconstruction demands the effective utilization of comprehensive scene information as conditional inputs. Existing methods predominantly rely on 3D bounding boxes and BEV road maps for foreground and background control, which fail to capture the full complexity of driving scenes and adequately integrate multimodal information. In this work, we present DualDiff, a dual-branch conditional diffusion model designed to enhance driving scene generation across multiple views and video sequences. Specifically, we introduce Occupancy Ray-shape Sampling (ORS) as a conditional input, offering rich foreground and background semantics alongside 3D spatial geometry to precisely control the generation of both elements. To improve the synthesis of fine-grained foreground objects, particularly complex and distant ones, we propose a Foreground-Aware Mask (FGM) denoising loss function. Additionally, we develop the Semantic Fusion Attention (SFA) mechanism to dynamically prioritize relevant information and suppress noise, enabling more effective multimodal fusion. Finally, to ensure high-quality image-to-video generation, we introduce the Reward-Guided Diffusion (RGD) framework, which maintains global consistency and semantic coherence in generated videos. Extensive experiments demonstrate that DualDiff achieves state-of-the-art (SOTA) performance across multiple datasets. On the NuScenes dataset, DualDiff reduces the FID score by 4.09% compared to the best baseline. In downstream tasks, such as BEV segmentation, our method improves vehicle mIoU by 4.50% and road mIoU by 1.70%, while in BEV 3D object detection, the foreground mAP increases by 1.46%. Code will be made available at https://github.com/yangzhaojason/DualDiff.

DualDiff+: Dual-Branch Diffusion for High-Fidelity Video Generation with Reward Guidance

TL;DR

DualDiff presents a dual-branch diffusion framework for high-fidelity driving scene generation conditioned on rich 3D geometry and multimodal inputs. Key innovations include Occupancy Ray-shape Sampling (ORS) for detailed foreground/background conditioning, Foreground-Aware Mask (FGM) denoising, Semantic Fusion Attention (SFA) for adaptive cross-modal fusion, and Reward-Guided Diffusion (RGD) for coherent image-to-video generation with high-level semantic alignment via . The approach achieves state-of-the-art results on NuScenes and Waymo, including a FID reduction on NuScenes, a reduction in FVD for video, and meaningful gains in downstream BEV segmentation and 3D object detection (e.g., foreground mAP +, road mIoU +, vehicle mIoU +). A data-centric closed-loop training strategy with corner-case sampling further improves downstream perception, demonstrating practical benefits for autonomous-driving perception pipelines. Overall, DualDiff advances geometry-aware, multimodal conditional generation with improved temporal coherence and task relevance for synthetic driving data.

Abstract

Accurate and high-fidelity driving scene reconstruction demands the effective utilization of comprehensive scene information as conditional inputs. Existing methods predominantly rely on 3D bounding boxes and BEV road maps for foreground and background control, which fail to capture the full complexity of driving scenes and adequately integrate multimodal information. In this work, we present DualDiff, a dual-branch conditional diffusion model designed to enhance driving scene generation across multiple views and video sequences. Specifically, we introduce Occupancy Ray-shape Sampling (ORS) as a conditional input, offering rich foreground and background semantics alongside 3D spatial geometry to precisely control the generation of both elements. To improve the synthesis of fine-grained foreground objects, particularly complex and distant ones, we propose a Foreground-Aware Mask (FGM) denoising loss function. Additionally, we develop the Semantic Fusion Attention (SFA) mechanism to dynamically prioritize relevant information and suppress noise, enabling more effective multimodal fusion. Finally, to ensure high-quality image-to-video generation, we introduce the Reward-Guided Diffusion (RGD) framework, which maintains global consistency and semantic coherence in generated videos. Extensive experiments demonstrate that DualDiff achieves state-of-the-art (SOTA) performance across multiple datasets. On the NuScenes dataset, DualDiff reduces the FID score by 4.09% compared to the best baseline. In downstream tasks, such as BEV segmentation, our method improves vehicle mIoU by 4.50% and road mIoU by 1.70%, while in BEV 3D object detection, the foreground mAP increases by 1.46%. Code will be made available at https://github.com/yangzhaojason/DualDiff.

Paper Structure

This paper contains 18 sections, 11 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Architecture Overview of DualDiff for Video Generation. The model uses Occupancy Ray-shape Sampling (ORS) and Semantic Fusion Attention (SFA) for scene representation, which are fed into a dual-branch foreground-background architecture. The outputs are merged through residual connections in a U-Net. Video generation follows a Two-stage optimization: Spatio-Temporal Attention (ST-Attn) and Temporal Attention (Temporal Attn) are trained in the first stage, while Reward-Guided Diffusion (RGD) and Low-Rank Adaptation (LoRA) fine-tune the attention in the second stage.
  • Figure 2: Illustrations of our proposed Occupancy Ray-shape Sampling (ORS) method, projecting 3D occupancy grid maps onto the image plane via ray-based sampling, where each pixel is associated with a 3D ray and uniformly sampled points are queried to generate a dense 2D feature representation.
  • Figure 3: Illustration of our proposed Semantic Fusion Attention (SFA) mechanism, which systematically integrates ORS features with Multimodal information. The SFA operates in a sequential manner, enhancing feature representation by leveraging complementary data from various modalities.
  • Figure 4: Overview of the Reward-Guided Diffusion Framework. For video generation, we extend the panoramic image generation approach by incorporating ST-Attn and Temporal Attn to enhance temporal consistency. In the fine-tuning process, we reduce the number of parameters by adding LoRA to the Attention mechanism of the original network. During training, latent variables are iteratively refined through a denoising loop, starting from pure noise. Denoised frames and ground truth are processed by the I3D model to extract temporal features, which are used to compute the reward function $R_{\text{I3D}}$. Dense gradients are propagated to optimize the model.
  • Figure 5: A comprehensive data-centric closed-loop framework comprising four key components: a. Image and video generation; b. Downstream task training; c. Corner case mining; d. Hard example retrieval.
  • ...and 5 more figures