Table of Contents
Fetching ...

Progressive Gaussian Transformer with Anisotropy-aware Sampling for Open Vocabulary Occupancy Prediction

Chi Yan, Dan Xu

TL;DR

Open-vocabulary 3D occupancy prediction struggles with balancing detail and efficiency when using text-aligned features. PG-Occ introduces a Progressive Gaussian Transformer that models scenes as text-informed Gaussian blobs and densifies them online in a coarse-to-fine manner, augmented by anisotropy-aware sampling and asymmetric self-attention to stabilize learning. Trained with purely 2D supervision through rasterization, it achieves state-of-the-art results on Occ3D-nuScenes (relative $14.3\%$ mIoU gain) and strong retrieval performance on nuScenes, while enabling zero-shot semantic occupancy via CLIP prompts. The approach offers a practical, efficient path to open-vocabulary 3D scene understanding suitable for autonomous driving and beyond.

Abstract

The 3D occupancy prediction task has witnessed remarkable progress in recent years, playing a crucial role in vision-based autonomous driving systems. While traditional methods are limited to fixed semantic categories, recent approaches have moved towards predicting text-aligned features to enable open-vocabulary text queries in real-world scenes. However, there exists a trade-off in text-aligned scene modeling: sparse Gaussian representation struggles to capture small objects in the scene, while dense representation incurs significant computational overhead. To address these limitations, we present PG-Occ, an innovative Progressive Gaussian Transformer Framework that enables open-vocabulary 3D occupancy prediction. Our framework employs progressive online densification, a feed-forward strategy that gradually enhances the 3D Gaussian representation to capture fine-grained scene details. By iteratively enhancing the representation, the framework achieves increasingly precise and detailed scene understanding. Another key contribution is the introduction of an anisotropy-aware sampling strategy with spatio-temporal fusion, which adaptively assigns receptive fields to Gaussians at different scales and stages, enabling more effective feature aggregation and richer scene information capture. Through extensive evaluations, we demonstrate that PG-Occ achieves state-of-the-art performance with a relative 14.3% mIoU improvement over the previous best performing method. Code and pretrained models will be released upon publication on our project page: https://yanchi-3dv.github.io/PG-Occ

Progressive Gaussian Transformer with Anisotropy-aware Sampling for Open Vocabulary Occupancy Prediction

TL;DR

Open-vocabulary 3D occupancy prediction struggles with balancing detail and efficiency when using text-aligned features. PG-Occ introduces a Progressive Gaussian Transformer that models scenes as text-informed Gaussian blobs and densifies them online in a coarse-to-fine manner, augmented by anisotropy-aware sampling and asymmetric self-attention to stabilize learning. Trained with purely 2D supervision through rasterization, it achieves state-of-the-art results on Occ3D-nuScenes (relative mIoU gain) and strong retrieval performance on nuScenes, while enabling zero-shot semantic occupancy via CLIP prompts. The approach offers a practical, efficient path to open-vocabulary 3D scene understanding suitable for autonomous driving and beyond.

Abstract

The 3D occupancy prediction task has witnessed remarkable progress in recent years, playing a crucial role in vision-based autonomous driving systems. While traditional methods are limited to fixed semantic categories, recent approaches have moved towards predicting text-aligned features to enable open-vocabulary text queries in real-world scenes. However, there exists a trade-off in text-aligned scene modeling: sparse Gaussian representation struggles to capture small objects in the scene, while dense representation incurs significant computational overhead. To address these limitations, we present PG-Occ, an innovative Progressive Gaussian Transformer Framework that enables open-vocabulary 3D occupancy prediction. Our framework employs progressive online densification, a feed-forward strategy that gradually enhances the 3D Gaussian representation to capture fine-grained scene details. By iteratively enhancing the representation, the framework achieves increasingly precise and detailed scene understanding. Another key contribution is the introduction of an anisotropy-aware sampling strategy with spatio-temporal fusion, which adaptively assigns receptive fields to Gaussians at different scales and stages, enabling more effective feature aggregation and richer scene information capture. Through extensive evaluations, we demonstrate that PG-Occ achieves state-of-the-art performance with a relative 14.3% mIoU improvement over the previous best performing method. Code and pretrained models will be released upon publication on our project page: https://yanchi-3dv.github.io/PG-Occ

Paper Structure

This paper contains 33 sections, 18 equations, 10 figures, 12 tables.

Figures (10)

  • Figure 1: Overview of the proposed PG-Occ framework. The radar chart compares occupancy prediction accuracy across multiple methods, showing the superior performance of PG-Occ. The central panel highlights the key components: progressive Gaussian modeling with online feed-forward densification, anisotropy-aware sampling with adaptive receptive fields, and open-vocabulary retrieval conditioned on prompt inputs. The bottom row illustrates an example progression from the current input view through successive densification stages to the final occupancy prediction.
  • Figure 2: Architecture of the proposed PG-Occ framework. The scene is represented as feature Gaussian blobs, starting from a base layer and progressively refined and densified through $B$ layers. Multi-camera inputs are processed to extract spatio-temporal features, which guide the update and refinement of the Gaussians, which are then voxelized to produce an any-resolution 3D occupancy field, enabling both geometric reconstruction and open-vocabulary semantic understanding.
  • Figure 3: Illustration of the Progressive Online Densification (POD) and Anisotropy-aware Feature Sampling (AFS) modules. POD leverages depth-aware densification to progressively add and refine 3D Gaussians. AFS exploits the anisotropic properties of Gaussians, sampling feature points within anisotropy-aware receptive fields to enable more effective spatio-temporal feature extraction.
  • Figure 4: Illustration of PG-Occ predictions. Given camera inputs and text prompts, the method predicts depth (column 2), produces open-vocabulary semantic labels (column 3), and generates the final semantic occupancy map (column 4). Additional visualizations are provided in \ref{['sec: add_vis_cap']}.
  • Figure 5: Depth estimation error metrics on the nuScenes validation set. The best results denoted in bold. Abs Rel is used as the primary evaluation metric.
  • ...and 5 more figures