Table of Contents
Fetching ...

Real-Time 3D Occupancy Prediction via Geometric-Semantic Disentanglement

Yulin He, Wei Chen, Tianci Xun, Yusong Tan

TL;DR

The paper tackles real-time 3D occupancy prediction for autonomous driving by addressing the coupling between geometry and semantics. It introduces a Geometric-Semantic Disentangled Occupancy predictor (GSD-Occ) with a Geometric-Semantic Dual-Branch Network (GSDBN) that uses a hybrid BEV-Voxel representation, and a Geometric-Semantic Decoupled Learning (GSDL) strategy to decouple geometry refinement from semantic learning. The method achieves a new real-time state-of-the-art on the Occ3D-nuScenes benchmark, with $39.4$ mIoU at $20.0$ FPS, about $3\times$ faster and $+1.9$ mIoU better than the previous winner, while maintaining lower memory. Key components include BEV-level temporal fusion, a large-kernel re-parameterized 3D convolution in the voxel branch, and a BEV-Voxel lifting module for feature fusion. The approach offers practical impact for autonomous driving by providing accurate, efficient, and robust 3D occupancy perception.

Abstract

Occupancy prediction plays a pivotal role in autonomous driving (AD) due to the fine-grained geometric perception and general object recognition capabilities. However, existing methods often incur high computational costs, which contradicts the real-time demands of AD. To this end, we first evaluate the speed and memory usage of most public available methods, aiming to redirect the focus from solely prioritizing accuracy to also considering efficiency. We then identify a core challenge in achieving both fast and accurate performance: \textbf{the strong coupling between geometry and semantic}. To address this issue, 1) we propose a Geometric-Semantic Dual-Branch Network (GSDBN) with a hybrid BEV-Voxel representation. In the BEV branch, a BEV-level temporal fusion module and a U-Net encoder is introduced to extract dense semantic features. In the voxel branch, a large-kernel re-parameterized 3D convolution is proposed to refine sparse 3D geometry and reduce computation. Moreover, we propose a novel BEV-Voxel lifting module that projects BEV features into voxel space for feature fusion of the two branches. In addition to the network design, 2) we also propose a Geometric-Semantic Decoupled Learning (GSDL) strategy. This strategy initially learns semantics with accurate geometry using ground-truth depth, and then gradually mixes predicted depth to adapt the model to the predicted geometry. Extensive experiments on the widely-used Occ3D-nuScenes benchmark demonstrate the superiority of our method, which achieves a 39.4 mIoU with 20.0 FPS. This result is $\sim 3 \times$ faster and +1.9 mIoU higher compared to FB-OCC, the winner of CVPR2023 3D Occupancy Prediction Challenge. Our code will be made open-source.

Real-Time 3D Occupancy Prediction via Geometric-Semantic Disentanglement

TL;DR

The paper tackles real-time 3D occupancy prediction for autonomous driving by addressing the coupling between geometry and semantics. It introduces a Geometric-Semantic Disentangled Occupancy predictor (GSD-Occ) with a Geometric-Semantic Dual-Branch Network (GSDBN) that uses a hybrid BEV-Voxel representation, and a Geometric-Semantic Decoupled Learning (GSDL) strategy to decouple geometry refinement from semantic learning. The method achieves a new real-time state-of-the-art on the Occ3D-nuScenes benchmark, with mIoU at FPS, about faster and mIoU better than the previous winner, while maintaining lower memory. Key components include BEV-level temporal fusion, a large-kernel re-parameterized 3D convolution in the voxel branch, and a BEV-Voxel lifting module for feature fusion. The approach offers practical impact for autonomous driving by providing accurate, efficient, and robust 3D occupancy perception.

Abstract

Occupancy prediction plays a pivotal role in autonomous driving (AD) due to the fine-grained geometric perception and general object recognition capabilities. However, existing methods often incur high computational costs, which contradicts the real-time demands of AD. To this end, we first evaluate the speed and memory usage of most public available methods, aiming to redirect the focus from solely prioritizing accuracy to also considering efficiency. We then identify a core challenge in achieving both fast and accurate performance: \textbf{the strong coupling between geometry and semantic}. To address this issue, 1) we propose a Geometric-Semantic Dual-Branch Network (GSDBN) with a hybrid BEV-Voxel representation. In the BEV branch, a BEV-level temporal fusion module and a U-Net encoder is introduced to extract dense semantic features. In the voxel branch, a large-kernel re-parameterized 3D convolution is proposed to refine sparse 3D geometry and reduce computation. Moreover, we propose a novel BEV-Voxel lifting module that projects BEV features into voxel space for feature fusion of the two branches. In addition to the network design, 2) we also propose a Geometric-Semantic Decoupled Learning (GSDL) strategy. This strategy initially learns semantics with accurate geometry using ground-truth depth, and then gradually mixes predicted depth to adapt the model to the predicted geometry. Extensive experiments on the widely-used Occ3D-nuScenes benchmark demonstrate the superiority of our method, which achieves a 39.4 mIoU with 20.0 FPS. This result is faster and +1.9 mIoU higher compared to FB-OCC, the winner of CVPR2023 3D Occupancy Prediction Challenge. Our code will be made open-source.
Paper Structure (18 sections, 5 equations, 6 figures, 7 tables)

This paper contains 18 sections, 5 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: The inference speed (FPS) and accuracy (mIoU) of occupancy prediction methods on the Occ3D-nuScenes occ3d benchmark. GSD-Occ has a clear advantage of accuracy in real-time.
  • Figure 2: Illustration of the geometric-semantic coupling problem. (a) Incorrect prediction depth can result in inaccurate 2D-to-3D feature projection, which requires refinement and correction by the subsequent network. (b) illustrates the performance gap between using prediction depth and ground-truth depth, which further underscores the importance of addressing this issue.
  • Figure 3: The overview of GSD-Occ. Multi-camera images are first fed into an image backbone network to get image features, and DepthNet Bevdepth is used to predict a depth distribution. The Lift-Splat-Shoot (LSS) lss module is then employed to explicitly transform 2D image features into 3D voxel features. Subsequently, the geometric-semantic dual-branch network exploits a hybrid BEV-Voxel representation to efficiently maintain geometric structure while extracting rich semantics. The geometric-semantic decoupled learning strategy injects ground-truth depth into LSS to separate the learning of geometric correction and semantic knowledge, thereby further improving accuracy.
  • Figure 4: Illustration of the large-kernel 3D convolutional re-parameterization technique in 3D geometric encoder. It uses parallel dilated small-kernel 3D convolutions to enhance a non-dilated large-kernel 3D convolution. This example shows $[K_H,K_W,K_Z]=[11,11,1]$.
  • Figure 5: Qualitative results comparison between FB-OCC and our method. The results demonstrate that our method is able to construct more detailed geometry (Row 1 and Row 2), more accurate semantics (Row 3), and stronger adaptability in night (Row 4).
  • ...and 1 more figures