Table of Contents
Fetching ...

TGP: Two-modal occupancy prediction with 3D Gaussian and sparse points for 3D Environment Awareness

Mu Chen, Wenyu Chen, Mingchuan Yang, Yuan Zhang, Tao Han, Xinchi Li, Yunlong Li, Huaici Zhao

TL;DR

3D occupancy prediction is improved by balancing volumetric structure with precise localization of geometry and semantics. The paper proposes TGP, a dual-modal framework that fuses 3D Gaussian representations with sparse points in a Transformer-based decoder with adaptive fusion and per-layer refinement. Gaussian attributes (mean, scale, rotation, semantics) enable flexible volumetric regions, integrated via Gaussian-to-voxel splatting and a multi-layer fusion strategy. Training uses a weighted Chamfer distance plus focal semantic loss across decoder layers, and experiments on Occ3D-nuScenes show superior IoU-based metrics (mIoU and RayIoU) compared with strong baselines, validating improved semantic occupancy accuracy with a modest speed trade-off.

Abstract

3D semantic occupancy has rapidly become a research focus in the fields of robotics and autonomous driving environment perception due to its ability to provide more realistic geometric perception and its closer integration with downstream tasks. By performing occupancy prediction of the 3D space in the environment, the ability and robustness of scene understanding can be effectively improved. However, existing occupancy prediction tasks are primarily modeled using voxel or point cloud-based approaches: voxel-based network structures often suffer from the loss of spatial information due to the voxelization process, while point cloud-based methods, although better at retaining spatial location information, face limitations in representing volumetric structural details. To address this issue, we propose a dual-modal prediction method based on 3D Gaussian sets and sparse points, which balances both spatial location and volumetric structural information, achieving higher accuracy in semantic occupancy prediction. Specifically, our method adopts a Transformer-based architecture, taking 3D Gaussian sets, sparse points, and queries as inputs. Through the multi-layer structure of the Transformer, the enhanced queries and 3D Gaussian sets jointly contribute to the semantic occupancy prediction, and an adaptive fusion mechanism integrates the semantic outputs of both modalities to generate the final prediction results. Additionally, to further improve accuracy, we dynamically refine the point cloud at each layer, allowing for more precise location information during occupancy prediction. We conducted experiments on the Occ3DnuScenes dataset, and the experimental results demonstrate superior performance of the proposed method on IoU based metrics.

TGP: Two-modal occupancy prediction with 3D Gaussian and sparse points for 3D Environment Awareness

TL;DR

3D occupancy prediction is improved by balancing volumetric structure with precise localization of geometry and semantics. The paper proposes TGP, a dual-modal framework that fuses 3D Gaussian representations with sparse points in a Transformer-based decoder with adaptive fusion and per-layer refinement. Gaussian attributes (mean, scale, rotation, semantics) enable flexible volumetric regions, integrated via Gaussian-to-voxel splatting and a multi-layer fusion strategy. Training uses a weighted Chamfer distance plus focal semantic loss across decoder layers, and experiments on Occ3D-nuScenes show superior IoU-based metrics (mIoU and RayIoU) compared with strong baselines, validating improved semantic occupancy accuracy with a modest speed trade-off.

Abstract

3D semantic occupancy has rapidly become a research focus in the fields of robotics and autonomous driving environment perception due to its ability to provide more realistic geometric perception and its closer integration with downstream tasks. By performing occupancy prediction of the 3D space in the environment, the ability and robustness of scene understanding can be effectively improved. However, existing occupancy prediction tasks are primarily modeled using voxel or point cloud-based approaches: voxel-based network structures often suffer from the loss of spatial information due to the voxelization process, while point cloud-based methods, although better at retaining spatial location information, face limitations in representing volumetric structural details. To address this issue, we propose a dual-modal prediction method based on 3D Gaussian sets and sparse points, which balances both spatial location and volumetric structural information, achieving higher accuracy in semantic occupancy prediction. Specifically, our method adopts a Transformer-based architecture, taking 3D Gaussian sets, sparse points, and queries as inputs. Through the multi-layer structure of the Transformer, the enhanced queries and 3D Gaussian sets jointly contribute to the semantic occupancy prediction, and an adaptive fusion mechanism integrates the semantic outputs of both modalities to generate the final prediction results. Additionally, to further improve accuracy, we dynamically refine the point cloud at each layer, allowing for more precise location information during occupancy prediction. We conducted experiments on the Occ3DnuScenes dataset, and the experimental results demonstrate superior performance of the proposed method on IoU based metrics.

Paper Structure

This paper contains 17 sections, 5 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Framework of the proposed occupancy prediction pipeline. The whole pipeline is designed by transformer paradigm with initial point position, 3D Gaussian representation, query, and continuous multiview image frames.
  • Figure 2: The illustration of two-modal decoder layer. The decoder, as the core component of the pipeline, is designed in the transformer architecture to perform key functions, including image feature sampling, generation of query features through the attention mechanism, and updating of Gaussian attributes.
  • Figure 3: Visualization comparison in four scenarios.