Table of Contents
Fetching ...

Generalizing Visual Geometry Priors to Sparse Gaussian Occupancy Prediction

Changqing Zhou, Yueru Luo, Changhao Chen

TL;DR

GPOcc is presented, a framework that leverages generalizable visual geometry priors (GPs) for monocular occupancy prediction and extends surface points inward along camera rays to generate volumetric samples, which are represented as Gaussian primitives for probabilistic occupancy inference.

Abstract

Accurate 3D scene understanding is essential for embodied intelligence, with occupancy prediction emerging as a key task for reasoning about both objects and free space. Existing approaches largely rely on depth priors (e.g., DepthAnything) but make only limited use of 3D cues, restricting performance and generalization. Recently, visual geometry models such as VGGT have shown strong capability in providing rich 3D priors, but similar to monocular depth foundation models, they still operate at the level of visible surfaces rather than volumetric interiors, motivating us to explore how to more effectively leverage these increasingly powerful geometry priors for 3D occupancy prediction. We present GPOcc, a framework that leverages generalizable visual geometry priors (GPs) for monocular occupancy prediction. Our method extends surface points inward along camera rays to generate volumetric samples, which are represented as Gaussian primitives for probabilistic occupancy inference. To handle streaming input, we further design a training-free incremental update strategy that fuses per-frame Gaussians into a unified global representation. Experiments on Occ-ScanNet and EmbodiedOcc-ScanNet demonstrate significant gains: GPOcc improves mIoU by +9.99 in the monocular setting and +11.79 in the streaming setting over prior state of the art. Under the same depth prior, it achieves +6.73 mIoU while running 2.65$\times$ faster. These results highlight that GPOcc leverages geometry priors more effectively and efficiently. Code will be released at https://github.com/JuIvyy/GPOcc.

Generalizing Visual Geometry Priors to Sparse Gaussian Occupancy Prediction

TL;DR

GPOcc is presented, a framework that leverages generalizable visual geometry priors (GPs) for monocular occupancy prediction and extends surface points inward along camera rays to generate volumetric samples, which are represented as Gaussian primitives for probabilistic occupancy inference.

Abstract

Accurate 3D scene understanding is essential for embodied intelligence, with occupancy prediction emerging as a key task for reasoning about both objects and free space. Existing approaches largely rely on depth priors (e.g., DepthAnything) but make only limited use of 3D cues, restricting performance and generalization. Recently, visual geometry models such as VGGT have shown strong capability in providing rich 3D priors, but similar to monocular depth foundation models, they still operate at the level of visible surfaces rather than volumetric interiors, motivating us to explore how to more effectively leverage these increasingly powerful geometry priors for 3D occupancy prediction. We present GPOcc, a framework that leverages generalizable visual geometry priors (GPs) for monocular occupancy prediction. Our method extends surface points inward along camera rays to generate volumetric samples, which are represented as Gaussian primitives for probabilistic occupancy inference. To handle streaming input, we further design a training-free incremental update strategy that fuses per-frame Gaussians into a unified global representation. Experiments on Occ-ScanNet and EmbodiedOcc-ScanNet demonstrate significant gains: GPOcc improves mIoU by +9.99 in the monocular setting and +11.79 in the streaming setting over prior state of the art. Under the same depth prior, it achieves +6.73 mIoU while running 2.65 faster. These results highlight that GPOcc leverages geometry priors more effectively and efficiently. Code will be released at https://github.com/JuIvyy/GPOcc.
Paper Structure (17 sections, 10 equations, 5 figures, 5 tables)

This paper contains 17 sections, 10 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Comparison of monocular occupancy prediction pipelines. ISO ISO formulates depth estimation as a multi-class classification problem, using the predicted depth distributions to lift 2D image features into dense 3D volumes, which are then processed by a 3D U-Net for occupancy prediction. EmbodiedOcc embodiedocc, by contrast, initializes random 3D anchors and applies cross-attention to aggregate image features, predicting Gaussian primitives that are splatted into voxels. Many of these Gaussians fall in empty regions, shown as gray primitives. In contrast, GPOcc employs ray-based volumetric sampling to generate sparse Gaussians concentrated on or within objects, producing a compact and efficient representation for occupancy inference.
  • Figure 2: Overview of GPOcc. Given an input RGB image, visual geometry priors (GPs) predicts surface points and extracts 3D-aware features. These surface points guide ray-based volumetric sampling to estimate interior points, which serve as Gaussian centers. The extracted features are combined with learnable embeddings to predict Gaussian attributes, and the resulting primitives are splatted to infer occupancy probabilistically. Monocular predictions are incrementally integrated into a global memory bank, enabling coherent large-scale occupancy construction.
  • Figure 3: Comparison of Gaussian representations. (a) EmbodiedOcc, where gray indicates Gaussians predicted as empty. A substantial portion of primitives are placed in empty space, resulting in an inefficient representation. (b) Our method is much more compact where Gaussians are concentrated in occupied regions.
  • Figure 4: Qualitative comparison on monocular occupancy prediction. (a) shows the input RGB images, (b) the ground-truth occupancy, (c) the predictions of EmbodiedOcc embodiedocc, (d) the predictions of our method, and (e) the visualization of the Gaussian primitives predicted by our method. Compared to EmbodiedOcc, our framework produces more accurate and complete reconstructions, while the Gaussian representation provides interpretable intermediate geometry.
  • Figure 5: Qualitative results on streaming inputs. Our incremental update strategy progressively integrates information from sequential frames. The predictions become increasingly complete as more frames are observed, demonstrating the effectiveness of our streaming design.