Table of Contents
Fetching ...

GaussianFormer: Scene as Gaussians for Vision-Based 3D Semantic Occupancy Prediction

Yuanhui Huang, Wenzhao Zheng, Yunpeng Zhang, Jie Zhou, Jiwen Lu

TL;DR

GaussianFormer introduces an object-centric 3D Gaussian representation to model driving scenes sparsely for vision-based 3D semantic occupancy. By learning a set of Gaussians with learnable mean, covariance, and semantic logits from multi-view images, and using self-encoding, cross-attention, and refinement, it achieves competitive occupancy predictions while dramatically reducing memory usage through a locality-driven Gaussian-to-voxel splatting. The approach demonstrates strong efficiency and accuracy on nuScenes and KITTI-360, with ablations validating the effectiveness of sparse convolution and multi-stage refinement. This work offers a scalable, memory-efficient alternative to dense grid representations for 3D scene understanding in autonomous driving.

Abstract

3D semantic occupancy prediction aims to obtain 3D fine-grained geometry and semantics of the surrounding scene and is an important task for the robustness of vision-centric autonomous driving. Most existing methods employ dense grids such as voxels as scene representations, which ignore the sparsity of occupancy and the diversity of object scales and thus lead to unbalanced allocation of resources. To address this, we propose an object-centric representation to describe 3D scenes with sparse 3D semantic Gaussians where each Gaussian represents a flexible region of interest and its semantic features. We aggregate information from images through the attention mechanism and iteratively refine the properties of 3D Gaussians including position, covariance, and semantics. We then propose an efficient Gaussian-to-voxel splatting method to generate 3D occupancy predictions, which only aggregates the neighboring Gaussians for a certain position. We conduct extensive experiments on the widely adopted nuScenes and KITTI-360 datasets. Experimental results demonstrate that GaussianFormer achieves comparable performance with state-of-the-art methods with only 17.8% - 24.8% of their memory consumption. Code is available at: https://github.com/huang-yh/GaussianFormer.

GaussianFormer: Scene as Gaussians for Vision-Based 3D Semantic Occupancy Prediction

TL;DR

GaussianFormer introduces an object-centric 3D Gaussian representation to model driving scenes sparsely for vision-based 3D semantic occupancy. By learning a set of Gaussians with learnable mean, covariance, and semantic logits from multi-view images, and using self-encoding, cross-attention, and refinement, it achieves competitive occupancy predictions while dramatically reducing memory usage through a locality-driven Gaussian-to-voxel splatting. The approach demonstrates strong efficiency and accuracy on nuScenes and KITTI-360, with ablations validating the effectiveness of sparse convolution and multi-stage refinement. This work offers a scalable, memory-efficient alternative to dense grid representations for 3D scene understanding in autonomous driving.

Abstract

3D semantic occupancy prediction aims to obtain 3D fine-grained geometry and semantics of the surrounding scene and is an important task for the robustness of vision-centric autonomous driving. Most existing methods employ dense grids such as voxels as scene representations, which ignore the sparsity of occupancy and the diversity of object scales and thus lead to unbalanced allocation of resources. To address this, we propose an object-centric representation to describe 3D scenes with sparse 3D semantic Gaussians where each Gaussian represents a flexible region of interest and its semantic features. We aggregate information from images through the attention mechanism and iteratively refine the properties of 3D Gaussians including position, covariance, and semantics. We then propose an efficient Gaussian-to-voxel splatting method to generate 3D occupancy predictions, which only aggregates the neighboring Gaussians for a certain position. We conduct extensive experiments on the widely adopted nuScenes and KITTI-360 datasets. Experimental results demonstrate that GaussianFormer achieves comparable performance with state-of-the-art methods with only 17.8% - 24.8% of their memory consumption. Code is available at: https://github.com/huang-yh/GaussianFormer.
Paper Structure (17 sections, 8 equations, 9 figures, 6 tables)

This paper contains 17 sections, 8 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Considering the universal approximating ability of Gaussian mixture dalal1983approximatinggoodfellow2016deep, we propose an object-centric 3D semantic Gaussian representation to describe the fine-grained structure of 3D scenes without the use of dense grids. We propose a GaussianFormer model consisting of sparse convolution and cross-attention to efficiently transform 2D images into 3D Gaussian representations. To generate dense 3D occupancy, we design a Gaussian-to-voxel splatting module that can be efficiently implemented with CUDA. With comparable performance, our GaussianFormer reduces memory consumption of existing 3D occupancy prediction methods by 75.2% - 82.2%.
  • Figure 2: Comparisions of the proposed 3D Gaussian representation with exiting grid-based scene representations (figures from TPVFormer huang2023tri). The voxel representation li2023voxformerwei2023surroundocc assigns each voxel in the 3D space with a feature and is redundant due to the sparsity nature of the 3D space. BEV li2022bevformer and TPV huang2023tri employ 2D planes to describe 3D space but can only alleviate the redundancy issue. Differently, the proposed object-centric 3D Gaussian representation can adapt to flexible regions of interest yet can still describe the fine-grained structure of the 3D scene due to the strong approximating ability of mixing Gaussians dalal1983approximatinggoodfellow2016deep.
  • Figure 3: Framework of our GaussianFormer for 3D semantic occupancy prediction. We first extract multi-scale (M.S.) features from image inputs using an image backbone. We then randomly initialized a set of queries and properties (mean, covariance, and semantics) to represent 3D Gaussians and update them with interleaved self-encoding, image cross-attention, and property refinement. Having obtained the updated 3D Gaussians, we employ an efficient Gaussian-to-voxel splatting module to generate dense 3D occupancy via local aggregation of Gaussians.
  • Figure 4: Illustration of the Gaussian-to-voxel splatting method in 2D. We first voxelize the 3D Gaussians and record the affected voxels of each 3D Gaussian by appending their paired indices to a list. Then we sort the list according to the voxel indices to identify the neighboring Gaussians of each voxel, followed by a local aggregation to generate the occupancy prediction.
  • Figure 5: Visualization results for 3D semantic occupancy prediction on nuScenes. We visualize the 3D Gaussians by treating them as ellipsoids centered at the Gaussian means with semi-axes determined by the Gaussian covariance matrices. Our GussianFormer not only achieves reasonable allocation of resources, but also captures the fine details of object shapes.
  • ...and 4 more figures