Table of Contents
Fetching ...

Reconstruction Matters: Learning Geometry-Aligned BEV Representation through 3D Gaussian Splatting

Yiren Lu, Xin Ye, Burhaneddin Yaman, Jingru Luo, Zhexiao Xiong, Liu Ren, Yu Yin

Abstract

Bird's-Eye-View (BEV) perception serves as a cornerstone for autonomous driving, offering a unified spatial representation that fuses surrounding-view images to enable reasoning for various downstream tasks, such as semantic segmentation, 3D object detection, and motion prediction. However, most existing BEV perception frameworks adopt an end-to-end training paradigm, where image features are directly transformed into the BEV space and optimized solely through downstream task supervision. This formulation treats the entire perception process as a black box, often lacking explicit 3D geometric understanding and interpretability, leading to suboptimal performance. In this paper, we claim that an explicit 3D representation matters for accurate BEV perception, and we propose Splat2BEV, a Gaussian Splatting-assisted framework for BEV tasks. Splat2BEV aims to learn BEV feature representations that are both semantically rich and geometrically precise. We first pre-train a Gaussian generator that explicitly reconstructs 3D scenes from multi-view inputs, enabling the generation of geometry-aligned feature representations. These representations are then projected into the BEV space to serve as inputs for downstream tasks. Extensive experiments on nuScenes and argoverse dataset demonstrate that Splat2BEV achieves state-of-the-art performance and validate the effectiveness of incorporating explicit 3D reconstruction into BEV perception.

Reconstruction Matters: Learning Geometry-Aligned BEV Representation through 3D Gaussian Splatting

Abstract

Bird's-Eye-View (BEV) perception serves as a cornerstone for autonomous driving, offering a unified spatial representation that fuses surrounding-view images to enable reasoning for various downstream tasks, such as semantic segmentation, 3D object detection, and motion prediction. However, most existing BEV perception frameworks adopt an end-to-end training paradigm, where image features are directly transformed into the BEV space and optimized solely through downstream task supervision. This formulation treats the entire perception process as a black box, often lacking explicit 3D geometric understanding and interpretability, leading to suboptimal performance. In this paper, we claim that an explicit 3D representation matters for accurate BEV perception, and we propose Splat2BEV, a Gaussian Splatting-assisted framework for BEV tasks. Splat2BEV aims to learn BEV feature representations that are both semantically rich and geometrically precise. We first pre-train a Gaussian generator that explicitly reconstructs 3D scenes from multi-view inputs, enabling the generation of geometry-aligned feature representations. These representations are then projected into the BEV space to serve as inputs for downstream tasks. Extensive experiments on nuScenes and argoverse dataset demonstrate that Splat2BEV achieves state-of-the-art performance and validate the effectiveness of incorporating explicit 3D reconstruction into BEV perception.
Paper Structure (19 sections, 13 equations, 5 figures, 8 tables)

This paper contains 19 sections, 13 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: An overview of the proposed Splat2BEV framework.Compared to traditional implicit end-to-end approaches, Splat2BEV first reconstructs the scene using 3D Gaussian Splatting, and then projects the reconstructed scene into the Bird’s-Eye-View to obtain geometry-aligned representations for downstream tasks, leading to substantial performance improvements.
  • Figure 2: An overview of our training process. Given multi-view perspective images as input, Splat2BEV first trains a feed-forward Gaussian generator to reconstruct 3D scene using 3D Gaussian Splatting. In stage 2, the Gaussian generator is frozen, and the reconstructed geometry along with its associated features are projected onto the BEV plane. A BEV encoder and segmentation head are then trained on top of this BEV representation to perform downstream tasks. Finally, in the third stage, the Gaussian generator, BEV encoder, and segmentation head are jointly fine-tuned, allowing geometry, semantics, and task-specific cues to be harmonized for optimal BEV perception.
  • Figure 3: Our Gaussian generator consists of two branches: a multi-view branch based on UniMatch xu2023unifying, and a per-view branch built upon ViT-S dosovitskiy2020vit to augment the multi-view representation. The multi-view branch outputs multi-view features and cost volumes, which are concatenated with monocular features from the per-view branch to serve as input for a U-Net ronneberger2015u to predict depth maps and per-Gaussian parameters. The predicted depths are then unprojected and combined with the Gaussian parameters to form 3D Gaussians.
  • Figure 4: Visualization of reconstruction quality. The left side shows the 3D reconstruction, its feature field, and the corresponding BEV map and projected BEV feature. The right side provides zoomed-in views that highlight fine-grained details, such as the zebra crossing in View 1, the dustbin in View 2, the person in View 3, and the no-parking line in View 4.
  • Figure 5: Visual comparison of features learned with and without explicit reconstruction. The BEV feature refers to the feature map produced by the BEV encoder, while the projected feature denotes the feature directly projected from the 3D representation.