Table of Contents
Fetching ...

ShelfGaussian: Shelf-Supervised Open-Vocabulary Gaussian-based 3D Scene Understanding

Lingjun Zhao, Yandong Luo, James Hay, Lu Gan

TL;DR

ShelfGaussian introduces open-vocabulary 3D scene understanding by representing scenes with sparse 3D Gaussians and training them with multi-modal signals from cameras, LiDAR, and radar, all supervised by off-the-shelf vision foundation models. A novel Multi-Modal Gaussian Transformer aggregates multi-sensor features to refine Gaussian parameters, while shelf-supervised learning aligns Gaussian renderings at 2D and 3D levels using a DINO-driven pseudo labeling engine and a CUDA-accelerated Gaussian-to-Voxel splatting module. The approach achieves state-of-the-art zero-shot semantic occupancy on Occ3D-nuScenes, competitive BEV segmentation, and improved Gaussian-based trajectory planning, validated in an in-the-wild UGV setting. The work demonstrates strong open-vocabulary capabilities, cross-modal fusion, and practical benefits for perception and planning in real-world robotics and autonomous systems.

Abstract

We introduce ShelfGaussian, an open-vocabulary multi-modal Gaussian-based 3D scene understanding framework supervised by off-the-shelf vision foundation models (VFMs). Gaussian-based methods have demonstrated superior performance and computational efficiency across a wide range of scene understanding tasks. However, existing methods either model objects as closed-set semantic Gaussians supervised by annotated 3D labels, neglecting their rendering ability, or learn open-set Gaussian representations via purely 2D self-supervision, leading to degraded geometry and limited to camera-only settings. To fully exploit the potential of Gaussians, we propose a Multi-Modal Gaussian Transformer that enables Gaussians to query features from diverse sensor modalities, and a Shelf-Supervised Learning Paradigm that efficiently optimizes Gaussians with VFM features jointly at 2D image and 3D scene levels. We evaluate ShelfGaussian on various perception and planning tasks. Experiments on Occ3D-nuScenes demonstrate its state-of-the-art zero-shot semantic occupancy prediction performance. ShelfGaussian is further evaluated on an unmanned ground vehicle (UGV) to assess its in the-wild performance across diverse urban scenarios. Project website: https://lunarlab-gatech.github.io/ShelfGaussian/.

ShelfGaussian: Shelf-Supervised Open-Vocabulary Gaussian-based 3D Scene Understanding

TL;DR

ShelfGaussian introduces open-vocabulary 3D scene understanding by representing scenes with sparse 3D Gaussians and training them with multi-modal signals from cameras, LiDAR, and radar, all supervised by off-the-shelf vision foundation models. A novel Multi-Modal Gaussian Transformer aggregates multi-sensor features to refine Gaussian parameters, while shelf-supervised learning aligns Gaussian renderings at 2D and 3D levels using a DINO-driven pseudo labeling engine and a CUDA-accelerated Gaussian-to-Voxel splatting module. The approach achieves state-of-the-art zero-shot semantic occupancy on Occ3D-nuScenes, competitive BEV segmentation, and improved Gaussian-based trajectory planning, validated in an in-the-wild UGV setting. The work demonstrates strong open-vocabulary capabilities, cross-modal fusion, and practical benefits for perception and planning in real-world robotics and autonomous systems.

Abstract

We introduce ShelfGaussian, an open-vocabulary multi-modal Gaussian-based 3D scene understanding framework supervised by off-the-shelf vision foundation models (VFMs). Gaussian-based methods have demonstrated superior performance and computational efficiency across a wide range of scene understanding tasks. However, existing methods either model objects as closed-set semantic Gaussians supervised by annotated 3D labels, neglecting their rendering ability, or learn open-set Gaussian representations via purely 2D self-supervision, leading to degraded geometry and limited to camera-only settings. To fully exploit the potential of Gaussians, we propose a Multi-Modal Gaussian Transformer that enables Gaussians to query features from diverse sensor modalities, and a Shelf-Supervised Learning Paradigm that efficiently optimizes Gaussians with VFM features jointly at 2D image and 3D scene levels. We evaluate ShelfGaussian on various perception and planning tasks. Experiments on Occ3D-nuScenes demonstrate its state-of-the-art zero-shot semantic occupancy prediction performance. ShelfGaussian is further evaluated on an unmanned ground vehicle (UGV) to assess its in the-wild performance across diverse urban scenarios. Project website: https://lunarlab-gatech.github.io/ShelfGaussian/.

Paper Structure

This paper contains 44 sections, 29 equations, 11 figures, 13 tables.

Figures (11)

  • Figure 1: We propose ShelfGaussian for Gaussian-based 3D scene understanding under open-vocabulary, multi-modal and multi-task scenario. (a) Our model is able to assist a robot in predicting open-set occupancy from any sensor modalities with the help of VFMs. (b) Compared to existing Gaussian-based methods, ours provides a generalizable solution for 3D scene understanding.
  • Figure 2: Overview of ShelfGaussian. ShelfGaussian employs off-the-shelf VFMs to extract depth and DINO feature maps from multi-view images, and trains LiDAR and radar backbones to extract related features. These are then fed into our multi-modal Gaussian transformer to predict sparse sets of 3D Gaussians to represent the scene. During training, Gaussians are rendered into camera views for VFM-based 2D supervision, while being converted into voxels via our CUDA-accelerated G2V module for 3D supervision. The shelf-supervised Gaussians support zero-shot semantic occupancy prediction, BEV segmentation, and are further evaluated for trajectory planning.
  • Figure 3: Overview of DINO-Driven Pseudo Labeling Engine. We teleoperate our UGV through urban scenarios to collect paired image and point cloud sequences along with trajectories from onboard camera and LiDAR. LiDAR points are then projected to image and decorated with pixel-wise DINO features. These points are aggregated and voxelized at a customized resolution to be 3D pseudo labels.
  • Figure 4: Dual-CSR Structure for CUDA-Accelerated Gaussian2Voxel. Gaussian$\rightarrow$Tile CSR: index pointers store tile offsets per Gaussian, indices record tile IDs, and values store Gaussian IDs. Tile$\rightarrow$Gaussian CSR: index pointers store Gaussian offsets per tile, and indices record Gaussian IDs obtained by sorting and run-length encoding (RLE) tile-Gaussian pairs.
  • Figure 5: Qualitative results of ShelfGaussian on nuScenes dataset. The figure demonstrates the predicted semantic occupancy queried by semantic classes in \ref{['tab:occ3d']}, ground-truth labels from Occ3D occ3d and occupancy of open-set queries from ShelfGaussian-LCR model. Best viewed on screen and color bar is given in \ref{['tab:occ3d']}.
  • ...and 6 more figures