Table of Contents
Fetching ...

MuDG: Taming Multi-modal Diffusion with Gaussian Splatting for Urban Scene Reconstruction

Yingshuang Zou, Yikang Ding, Chuanrui Zhang, Jiazhe Guo, Bohan Li, Xiaoyang Lyu, Feiyang Tan, Xiaojuan Qi, Haoqian Wang

TL;DR

MuDG tackles the challenge of robust urban scene reconstruction and novel-view synthesis under large viewpoint changes by marrying a controllable Multi-modal Diffusion Model (MDM) with Gaussian Splatting (GS). It conditions dense RGB, depth, and semantic outputs on fused LiDAR-derived sparse inputs, enabling feed-forward NVS without per-scene optimization while providing rich supervision for GS training. The framework demonstrates state-of-the-art results on the Open Waymo Dataset and supports scene editing and consistent multi-modal outputs across views. This approach has practical impact for autonomous driving, synthetic data generation, and robust 3D perception under extreme camera motions.

Abstract

Recent breakthroughs in radiance fields have significantly advanced 3D scene reconstruction and novel view synthesis (NVS) in autonomous driving. Nevertheless, critical limitations persist: reconstruction-based methods exhibit substantial performance deterioration under significant viewpoint deviations from training trajectories, while generation-based techniques struggle with temporal coherence and precise scene controllability. To overcome these challenges, we present MuDG, an innovative framework that integrates Multi-modal Diffusion model with Gaussian Splatting (GS) for Urban Scene Reconstruction. MuDG leverages aggregated LiDAR point clouds with RGB and geometric priors to condition a multi-modal video diffusion model, synthesizing photorealistic RGB, depth, and semantic outputs for novel viewpoints. This synthesis pipeline enables feed-forward NVS without computationally intensive per-scene optimization, providing comprehensive supervision signals to refine 3DGS representations for rendering robustness enhancement under extreme viewpoint changes. Experiments on the Open Waymo Dataset demonstrate that MuDG outperforms existing methods in both reconstruction and synthesis quality.

MuDG: Taming Multi-modal Diffusion with Gaussian Splatting for Urban Scene Reconstruction

TL;DR

MuDG tackles the challenge of robust urban scene reconstruction and novel-view synthesis under large viewpoint changes by marrying a controllable Multi-modal Diffusion Model (MDM) with Gaussian Splatting (GS). It conditions dense RGB, depth, and semantic outputs on fused LiDAR-derived sparse inputs, enabling feed-forward NVS without per-scene optimization while providing rich supervision for GS training. The framework demonstrates state-of-the-art results on the Open Waymo Dataset and supports scene editing and consistent multi-modal outputs across views. This approach has practical impact for autonomous driving, synthetic data generation, and robust 3D perception under extreme camera motions.

Abstract

Recent breakthroughs in radiance fields have significantly advanced 3D scene reconstruction and novel view synthesis (NVS) in autonomous driving. Nevertheless, critical limitations persist: reconstruction-based methods exhibit substantial performance deterioration under significant viewpoint deviations from training trajectories, while generation-based techniques struggle with temporal coherence and precise scene controllability. To overcome these challenges, we present MuDG, an innovative framework that integrates Multi-modal Diffusion model with Gaussian Splatting (GS) for Urban Scene Reconstruction. MuDG leverages aggregated LiDAR point clouds with RGB and geometric priors to condition a multi-modal video diffusion model, synthesizing photorealistic RGB, depth, and semantic outputs for novel viewpoints. This synthesis pipeline enables feed-forward NVS without computationally intensive per-scene optimization, providing comprehensive supervision signals to refine 3DGS representations for rendering robustness enhancement under extreme viewpoint changes. Experiments on the Open Waymo Dataset demonstrate that MuDG outperforms existing methods in both reconstruction and synthesis quality.

Paper Structure

This paper contains 21 sections, 10 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Given sparse sequential image-depth inputs, our multi-modal diffusion model enables controllable novel view synthesis, abstaining per-scene optimization. It also serves as a supervised signal to enhance the Gaussian Splatting model. Furthermore, its strong controllability allows for scene editing (e.g., object removal, background modification), providing valuable data for autonomous driving.
  • Figure 2: Framework of MuDG. (a) Training Phase of the Multi-modal Diffusion Model (MDM). Given a reference image and sparse conditions $\mathcal{P}_{\text{color}}$ and $\mathcal{P}_{\text{depth}}$, we fine-tune the Video Diffusion Model to generate a color image $\mathbf{I}_{i}$, a 3-channel depth map $\mathbf{D}_{i}^{\text{3ch}}$ and a colorized semantic map $\mathbf{S}_{i}^{\text{colorized}}$ using multi-task embeddings $\boldsymbol{s}$. (b) Pipeline of the Multi-modal 3DGS Scene. Using the dense output from the MDM module, we optimize a 3D Gaussian Splatting (3DGS) representation with better geometry-semantic consistency.
  • Figure 3: Illustration of inference phase. During inference, our multi-modal diffusion model takes a dense reference image and a sequence of condition images as inputs, generating the corresponding sequences of RGB, depth, and semantic maps.
  • Figure 4: Visualization of multi-modal results. Given the reference imagea and sparse conditions, we present the visualized multi-modal novel-view synthesis results (color, depth, and semantic map). The depth maps are visualized in the range of [$0$, $100m$]. The controllable and photorealistic results highlight the robust multi-modal synthesis capabilities of our approach, even under extreme viewpoint variations.
  • Figure 5: Qualitative Comparison of Novel View Synthesis on the Open Waymo Dataset. We evaluate novel view synthesis using a 2-meter leftward camera offset. Ours-R refers to the outputs generated by our multi-modal diffusion model, which relies exclusively on sparse conditional inputs (color and depth). In contrast, Ours-S represents the results of the 3D Gaussian Splatting (3DGS) model, which is trained in a supervised manner using the outputs from Ours-R module as training signals. Our approach achieves photorealistic consistency and demonstrates superior geometric stability under significant viewpoint changes, outperforming existing baseline methods.
  • ...and 2 more figures