Table of Contents
Fetching ...

ADGaussian: Generalizable Gaussian Splatting for Autonomous Driving with Multi-modal Inputs

Qi Song, Chenghong Li, Haotong Lin, Sida Peng, Rui Huang

TL;DR

ADGaussian tackles generalizable street scene reconstruction from monocular input by fusing color images with sparse LiDAR depth in a synchronized, multi-modal framework. It introduces multi-modal feature matching with a Siamese encoder and cross-attention, a depth-guided positional embedding, and a multi-scale Gaussian decoding head to jointly optimize appearance and geometry. The approach achieves state-of-the-art results on Waymo and competitive gains on KITTI, with strong zero-shot robustness to novel-view shifting, illustrating practical benefits for autonomous driving perception and rendering. By bridging LiDAR and camera information through joint optimization, ADGaussian delivers robust, scalable 3D reconstruction that generalizes across unseen urban scenes.

Abstract

We present a novel approach, termed ADGaussian, for generalizable street scene reconstruction. The proposed method enables high-quality rendering from single-view input. Unlike prior Gaussian Splatting methods that primarily focus on geometry refinement, we emphasize the importance of joint optimization of image and depth features for accurate Gaussian prediction. To this end, we first incorporate sparse LiDAR depth as an additional input modality, formulating the Gaussian prediction process as a joint learning framework of visual information and geometric clue. Furthermore, we propose a multi-modal feature matching strategy coupled with a multi-scale Gaussian decoding model to enhance the joint refinement of multi-modal features, thereby enabling efficient multi-modal Gaussian learning. Extensive experiments on two large-scale autonomous driving datasets, Waymo and KITTI, demonstrate that our ADGaussian achieves state-of-the-art performance and exhibits superior zero-shot generalization capabilities in novel-view shifting.

ADGaussian: Generalizable Gaussian Splatting for Autonomous Driving with Multi-modal Inputs

TL;DR

ADGaussian tackles generalizable street scene reconstruction from monocular input by fusing color images with sparse LiDAR depth in a synchronized, multi-modal framework. It introduces multi-modal feature matching with a Siamese encoder and cross-attention, a depth-guided positional embedding, and a multi-scale Gaussian decoding head to jointly optimize appearance and geometry. The approach achieves state-of-the-art results on Waymo and competitive gains on KITTI, with strong zero-shot robustness to novel-view shifting, illustrating practical benefits for autonomous driving perception and rendering. By bridging LiDAR and camera information through joint optimization, ADGaussian delivers robust, scalable 3D reconstruction that generalizes across unseen urban scenes.

Abstract

We present a novel approach, termed ADGaussian, for generalizable street scene reconstruction. The proposed method enables high-quality rendering from single-view input. Unlike prior Gaussian Splatting methods that primarily focus on geometry refinement, we emphasize the importance of joint optimization of image and depth features for accurate Gaussian prediction. To this end, we first incorporate sparse LiDAR depth as an additional input modality, formulating the Gaussian prediction process as a joint learning framework of visual information and geometric clue. Furthermore, we propose a multi-modal feature matching strategy coupled with a multi-scale Gaussian decoding model to enhance the joint refinement of multi-modal features, thereby enabling efficient multi-modal Gaussian learning. Extensive experiments on two large-scale autonomous driving datasets, Waymo and KITTI, demonstrate that our ADGaussian achieves state-of-the-art performance and exhibits superior zero-shot generalization capabilities in novel-view shifting.

Paper Structure

This paper contains 27 sections, 5 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: We introduce ADGaussian, a generalizable Gaussian framework for street scene reconstruction. Our approach achieves superior performance in both visual and geometric reconstruction. The bottom row illustrates the results of viewpoint shifting, further demonstrating the robustness of our method under varying viewpoint changes.
  • Figure 2: Overview of ADGaussian. Given monocular posed image with sparse depth as input, we first extract well-fused multi-modal features through Multi-modal Feature Matching, which contains a siamese-style encoder and a cross-attention decoder enhanced by Depth-guided positional embedding (DPE). Subsequently, the Gaussian Head and Geometry Head, augmented with Multi-scale Gaussian Decoding, are utilized to predict different Gaussian parameters.
  • Figure 3: Qualitative comparison with state of the art on Waymo dataset. Our ADGaussian surpasses all other competitive models in rendering quality within urban scenarios, thanks to the efficacy of our multi-modal matching-based architecture.
  • Figure 4: Qualitative comparison with state of the art on KITTI dataset. As highlighted by the red boxes, our ADGaussian demonstrates superior performance in preserving visual consistency, particularly in handling fine details such as thin poles and object edges.
  • Figure 5: Depth comparison with BPNet. Our method demonstrates superior depth estimation performance in certain challenging regions, even without depth pre-training.
  • ...and 1 more figures