VG3S: Visual Geometry Grounded Gaussian Splatting for Semantic Occupancy Prediction

Xiaoyang Yan; Muleilan Pei; Shaojie Shen

VG3S: Visual Geometry Grounded Gaussian Splatting for Semantic Occupancy Prediction

Xiaoyang Yan, Muleilan Pei, Shaojie Shen

TL;DR

This work introduces Visual Geometry Grounded Gaussian Splatting (VG3S), a novel framework that empowers Gaussian-based occupancy prediction with cross-view 3D geometric grounding and proposes a plug-and-play hierarchical geometric feature adapter, which can effectively transform generic VFM tokens via feature aggregation, task-specific alignment, and multi-scale restructuring.

Abstract

3D semantic occupancy prediction has become a crucial perception task for comprehensive scene understanding in autonomous driving. While recent advances have explored 3D Gaussian splatting for occupancy modeling to substantially reduce computational overhead, the generation of high-quality 3D Gaussians relies heavily on accurate geometric cues, which are often insufficient in purely vision-centric paradigms. To bridge this gap, we advocate for injecting the strong geometric grounding capability from Vision Foundation Models (VFMs) into occupancy prediction. In this regard, we introduce Visual Geometry Grounded Gaussian Splatting (VG3S), a novel framework that empowers Gaussian-based occupancy prediction with cross-view 3D geometric grounding. Specifically, to fully exploit the rich 3D geometric priors from a frozen VFM, we propose a plug-and-play hierarchical geometric feature adapter, which can effectively transform generic VFM tokens via feature aggregation, task-specific alignment, and multi-scale restructuring. Extensive experiments on the nuScenes occupancy benchmark demonstrate that VG3S achieves remarkable improvements of 12.6% in IoU and 7.5% in mIoU over the baseline. Furthermore, we show that VG3S generalizes seamlessly across diverse VFMs, consistently enhancing occupancy prediction accuracy and firmly underscoring the immense value of integrating priors derived from powerful, pre-trained geometry-grounded VFMs.

VG3S: Visual Geometry Grounded Gaussian Splatting for Semantic Occupancy Prediction

TL;DR

Abstract

Paper Structure (29 sections, 15 equations, 3 figures, 5 tables)

This paper contains 29 sections, 15 equations, 3 figures, 5 tables.

Introduction
Related Work
Scene-Centric Semantic Occupancy Prediction
Gaussian-Based 3D Scene Modeling
Geometry-Grounded Visual Foundation Models
Methodology
Problem Formulation
Framework Overview
Geometry-Grounded VFM Feature Extraction
Hierarchical Geometric Feature Adapter
Grouped Adaptive Token Fusion
Task-Aligned Token Refinement
Latent Spatial Feature Pyramid
Gaussian-to-Voxel Splatting
Training Objective
...and 14 more sections

Figures (3)

Figure 1: Comparison between existing Gaussian-based methods and our proposed VG3S. Existing approaches often produce semantic occupancy with incomplete object coverage due to the lack of accurate 3D geometric priors. In contrast, our VG3S incorporates rich 3D geometric priors embedded in a frozen VFM pre-trained on massive datasets, enabling the decoder to generate more geometrically accurate and consistent semantic occupancy predictions.
Figure 2: Framework overview of VG3S. Our approach leverages a powerful, pre-trained frozen VFM to provide rich 3D geometric priors, empowering the downstream Gaussian-based decoder with cross-view 3D geometric grounding and thereby significantly improving 3D semantic occupancy prediction.
Figure 3: Qualitative comparison between the baseline GaussianFormer-2 huang2025gaussianformer2 and our proposed VG3S. Our approach produces more geometrically accurate and consistent object structures across four challenging scenes compared to the baseline, demonstrating that leveraging strong 3D geometric priors embedded within VFMs significantly improves 3D semantic occupancy predictions.

VG3S: Visual Geometry Grounded Gaussian Splatting for Semantic Occupancy Prediction

TL;DR

Abstract

VG3S: Visual Geometry Grounded Gaussian Splatting for Semantic Occupancy Prediction

Authors

TL;DR

Abstract

Table of Contents

Figures (3)