Table of Contents
Fetching ...

DGOcc: Depth-aware Global Query-based Network for Monocular 3D Occupancy Prediction

Xu Zhao, Pengju Zhang, Bo Liu, Yihong Wu

TL;DR

DGOcc tackles monocular 3D occupancy prediction for outdoor scenes by addressing depth ambiguity and high compute costs. It introduces a Depth-aware Global Query-based Decoder that propagates 2D depth-aware features to 3D voxels and a Hierarchical Supervision Strategy that avoids full-resolution upsampling by supervising a subset of voxels at high resolution. The approach integrates a Depth Feature Extractor with multi-scale image features, and leverages a 3D UNet for multi-scale voxel representations, enabling long-range interactions via global queries. Empirical results on SemanticKITTI and SSCBench-KITTI-360 show state-of-the-art performance with substantial reductions in GPU memory and training/inference time, highlighting practical benefits for real-time monocular scene understanding. Overall, DGOcc advances depth-aware 3D occupancy prediction by combining explicit depth context, global-scale attention, and efficient hierarchical supervision, with significant implications for autonomous driving and 3D scene reconstruction.

Abstract

Monocular 3D occupancy prediction, aiming to predict the occupancy and semantics within interesting regions of 3D scenes from only 2D images, has garnered increasing attention recently for its vital role in 3D scene understanding. Predicting the 3D occupancy of large-scale outdoor scenes from 2D images is ill-posed and resource-intensive. In this paper, we present \textbf{DGOcc}, a \textbf{D}epth-aware \textbf{G}lobal query-based network for monocular 3D \textbf{Occ}upancy prediction. We first explore prior depth maps to extract depth context features that provide explicit geometric information for the occupancy network. Then, in order to fully exploit the depth context features, we propose a Global Query-based (GQ) Module. The cooperation of attention mechanisms and scale-aware operations facilitates the feature interaction between images and 3D voxels. Moreover, a Hierarchical Supervision Strategy (HSS) is designed to avoid upsampling the high-dimension 3D voxel features to full resolution, which mitigates GPU memory utilization and time cost. Extensive experiments on SemanticKITTI and SSCBench-KITTI-360 datasets demonstrate that the proposed method achieves the best performance on monocular semantic occupancy prediction while reducing GPU and time overhead.

DGOcc: Depth-aware Global Query-based Network for Monocular 3D Occupancy Prediction

TL;DR

DGOcc tackles monocular 3D occupancy prediction for outdoor scenes by addressing depth ambiguity and high compute costs. It introduces a Depth-aware Global Query-based Decoder that propagates 2D depth-aware features to 3D voxels and a Hierarchical Supervision Strategy that avoids full-resolution upsampling by supervising a subset of voxels at high resolution. The approach integrates a Depth Feature Extractor with multi-scale image features, and leverages a 3D UNet for multi-scale voxel representations, enabling long-range interactions via global queries. Empirical results on SemanticKITTI and SSCBench-KITTI-360 show state-of-the-art performance with substantial reductions in GPU memory and training/inference time, highlighting practical benefits for real-time monocular scene understanding. Overall, DGOcc advances depth-aware 3D occupancy prediction by combining explicit depth context, global-scale attention, and efficient hierarchical supervision, with significant implications for autonomous driving and 3D scene reconstruction.

Abstract

Monocular 3D occupancy prediction, aiming to predict the occupancy and semantics within interesting regions of 3D scenes from only 2D images, has garnered increasing attention recently for its vital role in 3D scene understanding. Predicting the 3D occupancy of large-scale outdoor scenes from 2D images is ill-posed and resource-intensive. In this paper, we present \textbf{DGOcc}, a \textbf{D}epth-aware \textbf{G}lobal query-based network for monocular 3D \textbf{Occ}upancy prediction. We first explore prior depth maps to extract depth context features that provide explicit geometric information for the occupancy network. Then, in order to fully exploit the depth context features, we propose a Global Query-based (GQ) Module. The cooperation of attention mechanisms and scale-aware operations facilitates the feature interaction between images and 3D voxels. Moreover, a Hierarchical Supervision Strategy (HSS) is designed to avoid upsampling the high-dimension 3D voxel features to full resolution, which mitigates GPU memory utilization and time cost. Extensive experiments on SemanticKITTI and SSCBench-KITTI-360 datasets demonstrate that the proposed method achieves the best performance on monocular semantic occupancy prediction while reducing GPU and time overhead.

Paper Structure

This paper contains 23 sections, 3 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Statistical results of the 3D semantic voxel ground truth in SemanticKITTI validation set. (a) The chart shows the percentage of voxels that don't require further subdivision at different resolutions. (b) The red voxels should be subdivided at $128\times 128\times 16$ resolution while the blue ones don't need.
  • Figure 2: Overview of DGOcc. Given input images, Pre-trained Depth Estimator first estimates a depth map for each image. Then, Heterogeneous Feature Encoder is employed to extract multi-scale image features and single-scale depth context features. The two features constitute the 2D depth-aware features. Global Query-based Decoder propagates information from 2D depth-aware features to 3D voxel features with global queries. The resulting 3D voxel features are finally fed to the Hierarchical Supervision Strategy module to generate hierarchical occupancy predictions.
  • Figure 3: Illustration of the Global Query-based Decoder. 3D voxel features and global queries are first initialized with 2D depth-aware features in respective modules. Then a multi-scale and global aware paradigm exchanges information between 3D voxel features and global queries. After $N$ iterations, 3D voxel features saturated with geometric and semantic cues are used for occupancy prediction.
  • Figure 4: Qualitative results of Symphonies and DGOcc on SemanticKITTI validation set. DGOcc possesses enhanced capability in hallucinating unseen regions, thus recoverying more complete scene. Moreover, DGOcc is expert in distinguishing different objects' locations, for example cars.
  • Figure 5: Qualitative results of Symphonies and DGOcc on SSCBench-KITTI-360 validation set. Our method possesses stronger hallucination capability and provides more accurate classification, thus constructing more complete 3D scenes.