Table of Contents
Fetching ...

Toward Real-world BEV Perception: Depth Uncertainty Estimation via Gaussian Splatting

Shu-Wei Lu, Yi-Hsuan Tsai, Yi-Ting Chen

TL;DR

This work tackles BEV perception under depth uncertainty by revisiting 2D unprojection and introducing a probabilistic depth model. GaussianLSS computes per-pixel depth distributions, converts them into 3D Gaussians, and uses Gaussian Splatting with multi-scale BEV rendering to produce uncertainty-aware BEV features. The method achieves state-of-the-art results among unprojection-based approaches and is competitive with projection-based methods, while delivering substantial speed (≈2.5x faster) and memory reductions (≈0.3x) on nuScenes, with only a marginal IoU gap. The approach is particularly robust for long-range objects and demonstrates effective attenuation of uncertain regions through learned opacity, highlighting practical potential for real-world autonomous driving systems.

Abstract

Bird's-eye view (BEV) perception has gained significant attention because it provides a unified representation to fuse multiple view images and enables a wide range of down-stream autonomous driving tasks, such as forecasting and planning. Recent state-of-the-art models utilize projection-based methods which formulate BEV perception as query learning to bypass explicit depth estimation. While we observe promising advancements in this paradigm, they still fall short of real-world applications because of the lack of uncertainty modeling and expensive computational requirement. In this work, we introduce GaussianLSS, a novel uncertainty-aware BEV perception framework that revisits unprojection-based methods, specifically the Lift-Splat-Shoot (LSS) paradigm, and enhances them with depth un-certainty modeling. GaussianLSS represents spatial dispersion by learning a soft depth mean and computing the variance of the depth distribution, which implicitly captures object extents. We then transform the depth distribution into 3D Gaussians and rasterize them to construct uncertainty-aware BEV features. We evaluate GaussianLSS on the nuScenes dataset, achieving state-of-the-art performance compared to unprojection-based methods. In particular, it provides significant advantages in speed, running 2.5x faster, and in memory efficiency, using 0.3x less memory compared to projection-based methods, while achieving competitive performance with only a 0.4% IoU difference.

Toward Real-world BEV Perception: Depth Uncertainty Estimation via Gaussian Splatting

TL;DR

This work tackles BEV perception under depth uncertainty by revisiting 2D unprojection and introducing a probabilistic depth model. GaussianLSS computes per-pixel depth distributions, converts them into 3D Gaussians, and uses Gaussian Splatting with multi-scale BEV rendering to produce uncertainty-aware BEV features. The method achieves state-of-the-art results among unprojection-based approaches and is competitive with projection-based methods, while delivering substantial speed (≈2.5x faster) and memory reductions (≈0.3x) on nuScenes, with only a marginal IoU gap. The approach is particularly robust for long-range objects and demonstrates effective attenuation of uncertain regions through learned opacity, highlighting practical potential for real-world autonomous driving systems.

Abstract

Bird's-eye view (BEV) perception has gained significant attention because it provides a unified representation to fuse multiple view images and enables a wide range of down-stream autonomous driving tasks, such as forecasting and planning. Recent state-of-the-art models utilize projection-based methods which formulate BEV perception as query learning to bypass explicit depth estimation. While we observe promising advancements in this paradigm, they still fall short of real-world applications because of the lack of uncertainty modeling and expensive computational requirement. In this work, we introduce GaussianLSS, a novel uncertainty-aware BEV perception framework that revisits unprojection-based methods, specifically the Lift-Splat-Shoot (LSS) paradigm, and enhances them with depth un-certainty modeling. GaussianLSS represents spatial dispersion by learning a soft depth mean and computing the variance of the depth distribution, which implicitly captures object extents. We then transform the depth distribution into 3D Gaussians and rasterize them to construct uncertainty-aware BEV features. We evaluate GaussianLSS on the nuScenes dataset, achieving state-of-the-art performance compared to unprojection-based methods. In particular, it provides significant advantages in speed, running 2.5x faster, and in memory efficiency, using 0.3x less memory compared to projection-based methods, while achieving competitive performance with only a 0.4% IoU difference.

Paper Structure

This paper contains 29 sections, 24 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Comparisons of 2D unprojection-based and 3D projection-based methods on vehicle BEV segmentation. GaussianLSS achieves state-of-the-art performance among 2D unprojection baselines. In addition, it also demonstrates competitive performance compared to 3D projection-based methods, while offering significant advantages in memory efficiency and inference speed.
  • Figure 2: Comparison between the lifting method of Lift-Splat-Shoot (LSS) philion2020lift and our proposed GaussianLSS. LSS uses discrete depth probabilities for soft depth weighting but struggles with depth ambiguity due to the inherently ill-posed nature of depth estimation. GaussianLSS addresses depth ambiguity by modeling depth uncertainty. We calculate the depth mean ($\mu$) and uncertainty ($\sigma$) of the predicted depth distribution, converting the original soft weighting to an uncertainty-aware range $[\mu-k\sigma,\mu+k\sigma]$. The parameter $k$ acts as an error tolerance coefficient to control the extent of the spread centered at the mean depth.
  • Figure 3: Overview of GaussianLSS. Multi-view images are first processed by a backbone network to extract features. They are then input to a simple CNN layer to obtain splat features $F_i$, opacity $\alpha_i$, and depth distribution $P_i$. The predicted depth distribution undergoes an uncertainty transformation to produce a 3D uncertainty $x_i$. Next, BEV features are obtained through a splatting process, integrating features across views. The resulting BEV features $\mathbf{F}_{\text{BEV}}$, enriched with uncertainty awareness, are used as input to the task-specific head for prediction.
  • Figure 4: Sweeping analysis on error tolerance $k$. We vary the error tolerance coefficient $k$ across a range of values ($k=[0.25,2.0]$). The results indicate that performance remains consistent for $k$ values between 0.5 and 1.25. However, when $k$ becomes too large, the IoU drops significantly as the model tolerates excessive ambiguity, causing the features to spread out too much and lose precision. The red dot represents the baseline approach of directly predicting the extent of the 3D mean.
  • Figure 5: Qualitative results demonstrating the effectiveness of semantic learning by filtering opacity values below 0.01. The yellow regions represent masked-out areas during features lifting. The left column shows the six camera views surrounding the ego-vehicle, with the top three views being front-facing and the bottom three being back-facing. The right column depicts BEV predictions overlapped with the ground truth segmentation for reference. The results demonstrate the model's ability to learn meaningful semantic features and accurately project relevant regions to the BEV plane. The ego-vehicle is centered in the map, with visualization highlights focusing on critical areas.
  • ...and 3 more figures