Table of Contents
Fetching ...

GEOcc: Geometrically Enhanced 3D Occupancy Network with Implicit-Explicit Depth Fusion and Contextual Self-Supervision

Xin Tan, Wenbin Wu, Zhiwei Zhang, Chaojie Fan, Yong Peng, Zhizhong Zhang, Yuan Xie, Lizhuang Ma

TL;DR

GEOcc tackles vision-only 3D occupancy by addressing depth modeling and LiDAR-sparsity generalization via a hybrid depth framework that fuses explicit lift-based depth with implicit projection-based depth, integrated through a light 3D convolution-based compressor and a mask-based Transformer decoder. It introduces context-aware self-training (CAST) to render depth maps from occupancy features and supervise them with image reconstruction losses, improving geometric priors without extra 3D labels. The approach achieves state-of-the-art mIoU on Occ3D-nuScenes with lightweight backbones (e.g., 43.64% with ResNet-50 and 44.67% with Swin-B) and demonstrates robust gains in ablations, though volume rendering remains a computational bottleneck to optimize for real-time deployment. The work advances vision-based surround-view perception by coupling hybrid depth fusion, semantic-aware decoding, and self-supervised geometric priors for denser, more generalizable occupancy representations.

Abstract

3D occupancy perception holds a pivotal role in recent vision-centric autonomous driving systems by converting surround-view images into integrated geometric and semantic representations within dense 3D grids. Nevertheless, current models still encounter two main challenges: modeling depth accurately in the 2D-3D view transformation stage, and overcoming the lack of generalizability issues due to sparse LiDAR supervision. To address these issues, this paper presents GEOcc, a Geometric-Enhanced Occupancy network tailored for vision-only surround-view perception. Our approach is three-fold: 1) Integration of explicit lift-based depth prediction and implicit projection-based transformers for depth modeling, enhancing the density and robustness of view transformation. 2) Utilization of mask-based encoder-decoder architecture for fine-grained semantic predictions; 3) Adoption of context-aware self-training loss functions in the pertaining stage to complement LiDAR supervision, involving the re-rendering of 2D depth maps from 3D occupancy features and leveraging image reconstruction loss to obtain denser depth supervision besides sparse LiDAR ground-truths. Our approach achieves State-Of-The-Art performance on the Occ3D-nuScenes dataset with the least image resolution needed and the most weightless image backbone compared with current models, marking an improvement of 3.3% due to our proposed contributions. Comprehensive experimentation also demonstrates the consistent superiority of our method over baselines and alternative approaches.

GEOcc: Geometrically Enhanced 3D Occupancy Network with Implicit-Explicit Depth Fusion and Contextual Self-Supervision

TL;DR

GEOcc tackles vision-only 3D occupancy by addressing depth modeling and LiDAR-sparsity generalization via a hybrid depth framework that fuses explicit lift-based depth with implicit projection-based depth, integrated through a light 3D convolution-based compressor and a mask-based Transformer decoder. It introduces context-aware self-training (CAST) to render depth maps from occupancy features and supervise them with image reconstruction losses, improving geometric priors without extra 3D labels. The approach achieves state-of-the-art mIoU on Occ3D-nuScenes with lightweight backbones (e.g., 43.64% with ResNet-50 and 44.67% with Swin-B) and demonstrates robust gains in ablations, though volume rendering remains a computational bottleneck to optimize for real-time deployment. The work advances vision-based surround-view perception by coupling hybrid depth fusion, semantic-aware decoding, and self-supervised geometric priors for denser, more generalizable occupancy representations.

Abstract

3D occupancy perception holds a pivotal role in recent vision-centric autonomous driving systems by converting surround-view images into integrated geometric and semantic representations within dense 3D grids. Nevertheless, current models still encounter two main challenges: modeling depth accurately in the 2D-3D view transformation stage, and overcoming the lack of generalizability issues due to sparse LiDAR supervision. To address these issues, this paper presents GEOcc, a Geometric-Enhanced Occupancy network tailored for vision-only surround-view perception. Our approach is three-fold: 1) Integration of explicit lift-based depth prediction and implicit projection-based transformers for depth modeling, enhancing the density and robustness of view transformation. 2) Utilization of mask-based encoder-decoder architecture for fine-grained semantic predictions; 3) Adoption of context-aware self-training loss functions in the pertaining stage to complement LiDAR supervision, involving the re-rendering of 2D depth maps from 3D occupancy features and leveraging image reconstruction loss to obtain denser depth supervision besides sparse LiDAR ground-truths. Our approach achieves State-Of-The-Art performance on the Occ3D-nuScenes dataset with the least image resolution needed and the most weightless image backbone compared with current models, marking an improvement of 3.3% due to our proposed contributions. Comprehensive experimentation also demonstrates the consistent superiority of our method over baselines and alternative approaches.
Paper Structure (14 sections, 11 equations, 8 figures, 5 tables)

This paper contains 14 sections, 11 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Performance comparison with other state-of-the-art methods. Different colors represent various image backbones, from the smallest ResNet-50 to the largest InternImage-B in the experiment settings of existing approaches. Our GEOcc network gets leading performance among others using lighter image backbones (ResNet-50 and SwinTransformer-B) and lower image resolutions.
  • Figure 2: The GEOcc network architecture processes surround-view images from multiple timestamps, with an image backbone to extract multi-scale image features $\mathbf{F}$ and depth distribution features $\mathbf{D}$. The Implicit Depth Modeling (IDM) performs self-attention and projection-based deformable cross-attention to generate implicit occupancy features $\mathbf{O}_\textbf{I}$, while the Explicit Depth Modeling (EDM) performs cross-product between lifted depth prediction and image features to create grid-like pseudo-LiDAR points, which are then warped into explicit occupancy feature $\mathbf{O}_\textbf{E}$. Occupancy features are subsequently fused and fed into a transformer encoder and a Masked Decoder Head at different resolutions to yield final predictions. In the geometric pretraining period, an extra MLP layer predicts geometric density, which is subsequently rendered into surround-view perspective depth maps. Finally, geometric pretraining is accomplished through the application of three types of Context-Aware Self-Training (CAST) losses. Better zoomed in.
  • Figure 3: Per-class mIoU comparison with the previous SOTA methods. Our approach achieves better performance in most classes. The class-wise comparisons between different methods are all under the image resolution of 256x704 and backbone network of ResNet-50 respectively.
  • Figure 4: Qualitative visualization results for occupancy prediction in surround camera view. Better viewed when zoomed in.
  • Figure 5: Visualization analysis of model performance under low-quality imaging conditions. The left shows a rainy scene. The right illustrates motion blur caused by road bumps.
  • ...and 3 more figures