Table of Contents
Fetching ...

GaussianAD: Gaussian-Centric End-to-End Autonomous Driving

Wenzhao Zheng, Junjie Wu, Yao Zheng, Sicheng Zuo, Zixun Xie, Longchao Yang, Yong Pan, Zhihui Hao, Peng Jia, Xianpeng Lang, Shanghang Zhang

TL;DR

Vision-based autonomous driving often trades off scene detail against computational efficiency when using dense BEV or sparse object representations. GaussianAD introduces a 3D Gaussian scene representation, with 4D sparse convolutions and Gaussian flow to model future scene evolution, enabling end-to-end perception, prediction, and planning from surround-view images. The method supports optional supervision for perception tasks and demonstrates strong end-to-end planning performance and 4D occupancy forecasting on nuScenes, with ablations highlighting the impact of supervision and pruning. This work offers a scalable, information-rich, sparse representation that reduces information loss in the perception-to-planning pipeline.

Abstract

Vision-based autonomous driving shows great potential due to its satisfactory performance and low costs. Most existing methods adopt dense representations (e.g., bird's eye view) or sparse representations (e.g., instance boxes) for decision-making, which suffer from the trade-off between comprehensiveness and efficiency. This paper explores a Gaussian-centric end-to-end autonomous driving (GaussianAD) framework and exploits 3D semantic Gaussians to extensively yet sparsely describe the scene. We initialize the scene with uniform 3D Gaussians and use surrounding-view images to progressively refine them to obtain the 3D Gaussian scene representation. We then use sparse convolutions to efficiently perform 3D perception (e.g., 3D detection, semantic map construction). We predict 3D flows for the Gaussians with dynamic semantics and plan the ego trajectory accordingly with an objective of future scene forecasting. Our GaussianAD can be trained in an end-to-end manner with optional perception labels when available. Extensive experiments on the widely used nuScenes dataset verify the effectiveness of our end-to-end GaussianAD on various tasks including motion planning, 3D occupancy prediction, and 4D occupancy forecasting. Code: https://github.com/wzzheng/GaussianAD.

GaussianAD: Gaussian-Centric End-to-End Autonomous Driving

TL;DR

Vision-based autonomous driving often trades off scene detail against computational efficiency when using dense BEV or sparse object representations. GaussianAD introduces a 3D Gaussian scene representation, with 4D sparse convolutions and Gaussian flow to model future scene evolution, enabling end-to-end perception, prediction, and planning from surround-view images. The method supports optional supervision for perception tasks and demonstrates strong end-to-end planning performance and 4D occupancy forecasting on nuScenes, with ablations highlighting the impact of supervision and pruning. This work offers a scalable, information-rich, sparse representation that reduces information loss in the perception-to-planning pipeline.

Abstract

Vision-based autonomous driving shows great potential due to its satisfactory performance and low costs. Most existing methods adopt dense representations (e.g., bird's eye view) or sparse representations (e.g., instance boxes) for decision-making, which suffer from the trade-off between comprehensiveness and efficiency. This paper explores a Gaussian-centric end-to-end autonomous driving (GaussianAD) framework and exploits 3D semantic Gaussians to extensively yet sparsely describe the scene. We initialize the scene with uniform 3D Gaussians and use surrounding-view images to progressively refine them to obtain the 3D Gaussian scene representation. We then use sparse convolutions to efficiently perform 3D perception (e.g., 3D detection, semantic map construction). We predict 3D flows for the Gaussians with dynamic semantics and plan the ego trajectory accordingly with an objective of future scene forecasting. Our GaussianAD can be trained in an end-to-end manner with optional perception labels when available. Extensive experiments on the widely used nuScenes dataset verify the effectiveness of our end-to-end GaussianAD on various tasks including motion planning, 3D occupancy prediction, and 4D occupancy forecasting. Code: https://github.com/wzzheng/GaussianAD.

Paper Structure

This paper contains 12 sections, 13 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Comparisons of different pipelines for autonomous driving. Conventional end-to-end autonomous driving methods usually obtain refined scene descriptions (e.g., 3D boxes, maps) as the interface for prediction and planning, which may omit certain critical information. Differently, the proposed GuassianAD employs sparse yet comprehensive 3D Gaussians to pass information through the pipeline to efficiently preserve more details. We can optionally impose dense or sparse supervision to instruct the learning of scene representations. Our pipeline can adapt to various data with different available annotations.
  • Figure 2: Overview of the proposed GaussianAD framework. We initialize the sequence of 3D scenes with uniform Gaussians and employ 4D sparse convolutions to enable interactions between Gaussians. We then extract multi-scale features from surrounding-view multi-frame image observations and use deformable cross-attention to incorporate them into the 3D Gaussians. Having obtained the temporal 3D Gaussians as the scene representation, we can optionally employ Gaussian-to-voxel splatting gaussianformer for dense tasks (e.g., 3D semantic occupancy) or use sparse convolutions and max-pooling voxelnext for sparse tasks (e.g., 3D object detection, HD map construction, motion prediction). We use a flow head to predict a 3D flow for each Gaussian and aggregate them for trajectory planning.
  • Figure 3: Illustration of the training of our GaussianAD. Our framework can accommodate training data with different annotations by optionally imposing the corresponding supervisions on the scene representation. Due to the explicit and structural nature of 3D Gaussians, we use global affine transformation to predict the future scene representations observed by the ego vehicle following the planned trajectory. We can then use future perception labels or future scene representations obtained from future observations as the supervision. They impose stronger constraints on the planned trajectory than the low-dimension trajectory discrepancy loss.
  • Figure 4: Visualizations of the results of our GaussianAD. We include the 3D object detection and planning results in the 3D occupancy visualizations. We also provide map visualizations. (Better viewed on a monitor when zoomed in.)