Table of Contents
Fetching ...

GaussianCaR: Gaussian Splatting for Efficient Camera-Radar Fusion

Santiago Montiel-Marín, Miguel Antunes-García, Fabio Sánchez-García, Angel Llamazares, Holger Caesar, Luis M. Bergasa

TL;DR

GaussianCaR addresses robust BEV perception by fusing camera and radar data through Gaussian Splatting, reframing fusion as modality→Gaussians→BEV. It introduces two modality-specific encoders (Pixels-to-Gaussians and Points-to-Gaussians) that lift features into a unified Gaussian space, followed by a four-stage multi-scale fusion and a DPT-based BEV decoder. The approach achieves state-of-the-art or competitive results on nuScenes BEV segmentation (e.g., IoU values for vehicles and map elements) while enabling fast inference, significantly outperforming some camera-only baselines and matching or surpassing rival fusion methods with roughly 3.2× faster runtimes. These results demonstrate the practicality of Gaussian-based latent fusion for scalable, real-time autonomous perception in diverse weather and traffic conditions.

Abstract

Robust and accurate perception of dynamic objects and map elements is crucial for autonomous vehicles performing safe navigation in complex traffic scenarios. While vision-only methods have become the de facto standard due to their technical advances, they can benefit from effective and cost-efficient fusion with radar measurements. In this work, we advance fusion methods by repurposing Gaussian Splatting as an efficient universal view transformer that bridges the view disparity gap, mapping both image pixels and radar points into a common Bird's-Eye View (BEV) representation. Our main contribution is GaussianCaR, an end-to-end network for BEV segmentation that, unlike prior BEV fusion methods, leverages Gaussian Splatting to map raw sensor information into latent features for efficient camera-radar fusion. Our architecture combines multi-scale fusion with a transformer decoder to efficiently extract BEV features. Experimental results demonstrate that our approach achieves performance on par with, or even surpassing, the state of the art on BEV segmentation tasks (57.3%, 82.9%, and 50.1% IoU for vehicles, roads, and lane dividers) on the nuScenes dataset, while maintaining a 3.2x faster inference runtime. Code and project page are available online.

GaussianCaR: Gaussian Splatting for Efficient Camera-Radar Fusion

TL;DR

GaussianCaR addresses robust BEV perception by fusing camera and radar data through Gaussian Splatting, reframing fusion as modality→Gaussians→BEV. It introduces two modality-specific encoders (Pixels-to-Gaussians and Points-to-Gaussians) that lift features into a unified Gaussian space, followed by a four-stage multi-scale fusion and a DPT-based BEV decoder. The approach achieves state-of-the-art or competitive results on nuScenes BEV segmentation (e.g., IoU values for vehicles and map elements) while enabling fast inference, significantly outperforming some camera-only baselines and matching or surpassing rival fusion methods with roughly 3.2× faster runtimes. These results demonstrate the practicality of Gaussian-based latent fusion for scalable, real-time autonomous perception in diverse weather and traffic conditions.

Abstract

Robust and accurate perception of dynamic objects and map elements is crucial for autonomous vehicles performing safe navigation in complex traffic scenarios. While vision-only methods have become the de facto standard due to their technical advances, they can benefit from effective and cost-efficient fusion with radar measurements. In this work, we advance fusion methods by repurposing Gaussian Splatting as an efficient universal view transformer that bridges the view disparity gap, mapping both image pixels and radar points into a common Bird's-Eye View (BEV) representation. Our main contribution is GaussianCaR, an end-to-end network for BEV segmentation that, unlike prior BEV fusion methods, leverages Gaussian Splatting to map raw sensor information into latent features for efficient camera-radar fusion. Our architecture combines multi-scale fusion with a transformer decoder to efficiently extract BEV features. Experimental results demonstrate that our approach achieves performance on par with, or even surpassing, the state of the art on BEV segmentation tasks (57.3%, 82.9%, and 50.1% IoU for vehicles, roads, and lane dividers) on the nuScenes dataset, while maintaining a 3.2x faster inference runtime. Code and project page are available online.
Paper Structure (15 sections, 6 equations, 6 figures, 4 tables)

This paper contains 15 sections, 6 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: We propose GaussianCaR, a novel method for efficient camera-radar fusion. We envision sensor fusion as a modality$\rightarrow$Gaussians$\rightarrow$BEV transformation, achieving competitive accuracy with significantly fast inference times for BEV segmentation tasks.
  • Figure 2: Main diagram of our proposal, GaussianCaR. Given multi-view camera images and radar point clouds, we leverage Gaussian Splatting as a universal view transformer and formulate sensor fusion as modality$\rightarrow$Gaussians$\rightarrow$BEV transformation. The model predicts BEV segmentation maps for dynamic vehicles and map elements. We employ two feature encoding branches: Pixels-to-Gaussians for camera features and Points-to-Gaussians for radar point clouds. Features are splatted and fused in BEV space using a CMX-based fuser, and decoded via a DPT decoder.
  • Figure 3: Gaussian modeling process. In (a), we present the process of extracting a Gaussian from a discrete probability distribution; in (b), we depict the behavior of the offset head, displacing the final Gaussian position from the original set of candidates; in (c), we illustrate the Gaussian rasterization process, projecting Gaussians from 3D space to BEV space via orthographic projection.
  • Figure 4: Our Pixels-to-Gaussians extracts low-resolution feature maps using an EfficientViT backbone and a neck. A set of convolutional heads predicts $\mathcal{G}_c$ Gaussians. To position the Gaussians in 3D space, camera intrinsic and extrinsic matrices are used.
  • Figure 5: Our proposed Points-to-Gaussians module processes radar point clouds using a lightweight PTv3, composed of $\mathcal{E}$ encoder and $\mathcal{D}$ decoder blocks. A set of MLP heads then predicts $\mathcal{G}_r$ Gaussians, each parameterized by geometric and semantic attributes.
  • ...and 1 more figures