Table of Contents
Fetching ...

GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding

Haoyi Jiang, Liu Liu, Tianheng Cheng, Xinjie Wang, Tianwei Lin, Zhizhong Su, Wenyu Liu, Xinggang Wang

TL;DR

GaussTR introduces a sparse Gaussian Transformer that represents 3D scenes as sets of learnable Gaussians and aligns these representations with vision-language foundation models through differentiable Gaussian splatting. The approach enables self-supervised 3D spatial understanding with open-vocabulary occupancy prediction, achieving state-of-the-art zero-shot performance on Occ3D-nuScenes (12.27 mIoU) while reducing training time by about 40%. Key innovations include deformable cross-attention for Gaussian queries, global self-attention across Gaussians, PCA-based feature compression, and a loss stack integrating feature, depth, and optional segmentation supervision. By bridging 3D Gaussian modeling with foundation-model priors, GaussTR facilitates scalable, generalizable 3D perception suitable for autonomous driving and embodied agents, with robust object-centric performance and promising open-vocabulary capabilities.

Abstract

3D Semantic Occupancy Prediction is fundamental for spatial understanding, yet existing approaches face challenges in scalability and generalization due to their reliance on extensive labeled data and computationally intensive voxel-wise representations. In this paper, we introduce GaussTR, a novel Gaussian-based Transformer framework that unifies sparse 3D modeling with foundation model alignment through Gaussian representations to advance 3D spatial understanding. GaussTR predicts sparse sets of Gaussians in a feed-forward manner to represent 3D scenes. By splatting the Gaussians into 2D views and aligning the rendered features with foundation models, GaussTR facilitates self-supervised 3D representation learning and enables open-vocabulary semantic occupancy prediction without requiring explicit annotations. Empirical experiments on the Occ3D-nuScenes dataset demonstrate GaussTR's state-of-the-art zero-shot performance of 12.27 mIoU, along with a 40% reduction in training time. These results highlight the efficacy of GaussTR for scalable and holistic 3D spatial understanding, with promising implications in autonomous driving and embodied agents. The code is available at https://github.com/hustvl/GaussTR.

GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding

TL;DR

GaussTR introduces a sparse Gaussian Transformer that represents 3D scenes as sets of learnable Gaussians and aligns these representations with vision-language foundation models through differentiable Gaussian splatting. The approach enables self-supervised 3D spatial understanding with open-vocabulary occupancy prediction, achieving state-of-the-art zero-shot performance on Occ3D-nuScenes (12.27 mIoU) while reducing training time by about 40%. Key innovations include deformable cross-attention for Gaussian queries, global self-attention across Gaussians, PCA-based feature compression, and a loss stack integrating feature, depth, and optional segmentation supervision. By bridging 3D Gaussian modeling with foundation-model priors, GaussTR facilitates scalable, generalizable 3D perception suitable for autonomous driving and embodied agents, with robust object-centric performance and promising open-vocabulary capabilities.

Abstract

3D Semantic Occupancy Prediction is fundamental for spatial understanding, yet existing approaches face challenges in scalability and generalization due to their reliance on extensive labeled data and computationally intensive voxel-wise representations. In this paper, we introduce GaussTR, a novel Gaussian-based Transformer framework that unifies sparse 3D modeling with foundation model alignment through Gaussian representations to advance 3D spatial understanding. GaussTR predicts sparse sets of Gaussians in a feed-forward manner to represent 3D scenes. By splatting the Gaussians into 2D views and aligning the rendered features with foundation models, GaussTR facilitates self-supervised 3D representation learning and enables open-vocabulary semantic occupancy prediction without requiring explicit annotations. Empirical experiments on the Occ3D-nuScenes dataset demonstrate GaussTR's state-of-the-art zero-shot performance of 12.27 mIoU, along with a 40% reduction in training time. These results highlight the efficacy of GaussTR for scalable and holistic 3D spatial understanding, with promising implications in autonomous driving and embodied agents. The code is available at https://github.com/hustvl/GaussTR.

Paper Structure

This paper contains 19 sections, 15 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Comparative performance of self-supervised 3D occupancy prediction methods. GaussTR achieves a 2.33 mIoU (23%) improvement over other counterparts while reducing training time by approximately 40%, using merely 3% of the scene representation parameters (e.g., voxels or Gaussians). Marker sizes are proportional to the logarithm of scene representation parameters.
  • Figure 2: Architectural overview of the GaussTR framework. The GaussTR framework initiates with extracting multi-view features with pre-trained foundation models. A series of Transformer layers then predict sparse sets of Gaussian queries to represent the 3D scene. During the training phase, predicted Gaussians are rendered via differentiable splatting into source 2D views, enforcing alignment with 2D depth and features from foundation models. At inference, Gaussian features are converted into semantic logits by measuring similarity with text-embedded category vectors, followed by voxelization to produce volumetric predictions.
  • Figure 3: Qualitative visualizations of GaussTR on Occ3D-nuScenes Occ3D. GaussTR consistently produces both a coherent global scene structures and fine-grained local details, offering a comprehensive understanding of the environment. Notably, it excels at modeling object-centric categories, such as cars and buildings.
  • Figure 4: Visualizations of rendered views. The figure illustrates the rendered depth and segmentation maps of Gaussian predictions derived from camera views. Moreover, activation maps for novel categories are visualized, highlighted in red boxes.