Table of Contents
Fetching ...

GaussianBeV: 3D Gaussian Representation meets Perception Models for BeV Segmentation

Florian Chabot, Nicolas Granger, Guillaume Lapouge

TL;DR

GaussianBeV tackles the BeV semantic segmentation bottleneck by replacing traditional view transformers with an optimization-free 3D Gaussian scene representation learned from multi-view images. It predicts a world-space set of Gaussians (center, scale, rotation, opacity, embeddings) per pixel, then renders a BeV feature map through a differentiable Gaussian splatting rasterizer with orthographic projection. The method integrates end-to-end training using semantic, depth, and early supervision losses, achieving state-of-the-art results on nuScenes and delivering competitive inference speeds. This approach enables finer 3D scene modeling directly in BeV for robust, real-time perception in autonomous driving contexts.

Abstract

The Bird's-eye View (BeV) representation is widely used for 3D perception from multi-view camera images. It allows to merge features from different cameras into a common space, providing a unified representation of the 3D scene. The key component is the view transformer, which transforms image views into the BeV. However, actual view transformer methods based on geometry or cross-attention do not provide a sufficiently detailed representation of the scene, as they use a sub-sampling of the 3D space that is non-optimal for modeling the fine structures of the environment. In this paper, we propose GaussianBeV, a novel method for transforming image features to BeV by finely representing the scene using a set of 3D gaussians located and oriented in 3D space. This representation is then splattered to produce the BeV feature map by adapting recent advances in 3D representation rendering based on gaussian splatting. GaussianBeV is the first approach to use this 3D gaussian modeling and 3D scene rendering process online, i.e. without optimizing it on a specific scene and directly integrated into a single stage model for BeV scene understanding. Experiments show that the proposed representation is highly effective and place GaussianBeV as the new state-of-the-art on the BeV semantic segmentation task on the nuScenes dataset.

GaussianBeV: 3D Gaussian Representation meets Perception Models for BeV Segmentation

TL;DR

GaussianBeV tackles the BeV semantic segmentation bottleneck by replacing traditional view transformers with an optimization-free 3D Gaussian scene representation learned from multi-view images. It predicts a world-space set of Gaussians (center, scale, rotation, opacity, embeddings) per pixel, then renders a BeV feature map through a differentiable Gaussian splatting rasterizer with orthographic projection. The method integrates end-to-end training using semantic, depth, and early supervision losses, achieving state-of-the-art results on nuScenes and delivering competitive inference speeds. This approach enables finer 3D scene modeling directly in BeV for robust, real-time perception in autonomous driving contexts.

Abstract

The Bird's-eye View (BeV) representation is widely used for 3D perception from multi-view camera images. It allows to merge features from different cameras into a common space, providing a unified representation of the 3D scene. The key component is the view transformer, which transforms image views into the BeV. However, actual view transformer methods based on geometry or cross-attention do not provide a sufficiently detailed representation of the scene, as they use a sub-sampling of the 3D space that is non-optimal for modeling the fine structures of the environment. In this paper, we propose GaussianBeV, a novel method for transforming image features to BeV by finely representing the scene using a set of 3D gaussians located and oriented in 3D space. This representation is then splattered to produce the BeV feature map by adapting recent advances in 3D representation rendering based on gaussian splatting. GaussianBeV is the first approach to use this 3D gaussian modeling and 3D scene rendering process online, i.e. without optimizing it on a specific scene and directly integrated into a single stage model for BeV scene understanding. Experiments show that the proposed representation is highly effective and place GaussianBeV as the new state-of-the-art on the BeV semantic segmentation task on the nuScenes dataset.
Paper Structure (11 sections, 9 equations, 5 figures, 4 tables)

This paper contains 11 sections, 9 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Illustration of multiple BeV representations for BeV semantic segmentation. A camera is represented by the triangle at the bottom of each BeV. Features are represented by colors where blue, red and green represent the streetlight, car and lane marking respectively. (a) Depth-based methods place image features along the optical ray on the surface of objects. (b) In projection-based methods, 3D points on the optical ray receive the same feature. (c) Attention-based methods use downsampled dense spatial queries to keep memory costs down. (d) In GaussianBeV, the scene is represented by a set of rotated gaussians that finely describes the semantic structures in the scene.
  • Figure 2: Overview of GaussianBeV. The network takes as input a set of multiview images and extracts features for each of them. The 3D gaussian generator module (Sec \ref{['sec:gaussiangenerator']}) predicts a 3D gaussian representation $G$ of the scene which is then sent to the BeV rasterizer module (Sec \ref{['sec:bevrasterizer']}) to performs BeV rendering. The resulting BeV feature map $B$ is passed through a BeV backbone and segmentation heads to obtain the segmentation prediction. $G$ and $B$ are represented with colors only for visualization purpose.
  • Figure 3: 3D Gaussian generator. This module takes as input the feature map extracted from each camera, as well as the set of intrinsic $\left\{K_n\right\}$ and extrinsic $\left\{R_n| t_n\right\}$ parameters. For each pixel, it calculates the corresponding 3D gaussian by passing through prediction heads (green boxes). Some of these predictions are then decoded (blue and red boxes) and transformed to be expressed in the world reference frame (yellow boxes). All predicted gaussian parameters are then concatenated to produce the 3D gaussian representation $G$ of the scene.
  • Figure 4: Visualization of predicted vehicle segmentation in the first three rows and drivable area (blue) / lane boundary (orange) segmentation in the last three rows, on the nuScenes validation set. PCA is used to vizualize the BeV feature map.
  • Figure 5: Influence of the early supervision. (a) Vehicle segmentation ground truth. (b) and (c) visualization of the BeV feature map for GaussianBeV trained without and with early supervision, respectively. Early supervision helps to achieve a representation closer to the content of the 3D scene. We observe that gaussians corresponding to the vehicles follow their shapes.