Dr.Occ: Depth- and Region-Guided 3D Occupancy from Surround-View Cameras for Autonomous Driving

Xubo Zhu; Haoyang Zhang; Fei He; Rui Wu; Yanhu Shan; Wen Yang; Huai Yu

Dr.Occ: Depth- and Region-Guided 3D Occupancy from Surround-View Cameras for Autonomous Driving

Xubo Zhu, Haoyang Zhang, Fei He, Rui Wu, Yanhu Shan, Wen Yang, Huai Yu

TL;DR

A depth-guided 2D-to-3D View Transformer that effectively leverages high-quality dense depth cues from MoGe-2 to construct reliable geometric priors, thereby enabling precise geometric alignment of voxel features and a region-guided Expert Transformer that adaptively allocates region-specific experts to focus on different spatial regions, effectively addressing spatial semantic variations.

Abstract

3D semantic occupancy prediction is crucial for autonomous driving perception, offering comprehensive geometric scene understanding and semantic recognition. However, existing methods struggle with geometric misalignment in view transformation due to the lack of pixel-level accurate depth estimation, and severe spatial class imbalance where semantic categories exhibit strong spatial anisotropy. To address these challenges, we propose Dr.Occ, a depth- and region-guided occupancy prediction framework. Specifically, we introduce a depth-guided 2D-to-3D View Transformer (D$^2$-VFormer) that effectively leverages high-quality dense depth cues from MoGe-2 to construct reliable geometric priors, thereby enabling precise geometric alignment of voxel features. Moreover, inspired by the Mixture-of-Experts (MoE) framework, we propose a region-guided Expert Transformer (R/R$^2$-EFormer) that adaptively allocates region-specific experts to focus on different spatial regions, effectively addressing spatial semantic variations. Thus, the two components make complementary contributions: depth guidance ensures geometric alignment, while region experts enhance semantic learning. Experiments on the Occ3D-nuScenes benchmark demonstrate that \textbf{Dr.Occ} improves the strong baseline BEVDet4D by 7.43\% mIoU and 3.09\% IoU under the full vision-only setting.

Dr.Occ: Depth- and Region-Guided 3D Occupancy from Surround-View Cameras for Autonomous Driving

TL;DR

Abstract

-VFormer) that effectively leverages high-quality dense depth cues from MoGe-2 to construct reliable geometric priors, thereby enabling precise geometric alignment of voxel features. Moreover, inspired by the Mixture-of-Experts (MoE) framework, we propose a region-guided Expert Transformer (R/R

-EFormer) that adaptively allocates region-specific experts to focus on different spatial regions, effectively addressing spatial semantic variations. Thus, the two components make complementary contributions: depth guidance ensures geometric alignment, while region experts enhance semantic learning. Experiments on the Occ3D-nuScenes benchmark demonstrate that \textbf{Dr.Occ} improves the strong baseline BEVDet4D by 7.43\% mIoU and 3.09\% IoU under the full vision-only setting.

Paper Structure (28 sections, 10 equations, 6 figures, 2 tables)

This paper contains 28 sections, 10 equations, 6 figures, 2 tables.

Introduction
Related Work
2D-to-3D View Transformation
Vision-based 3D Occupancy Prediction
Mixture of Experts and Mixture of Recursions
Methods
Problem Formulation
Overall Architecture
Image Encoder
Depth-guided Geometric Enhancement
Incorporating Depth Cues
Depth-guided Dual Projection View Transformer
Region-guided Semantic Enhancement
Anisotropic Spatial Semantics
Region-guided Expert Transformer
...and 13 more sections

Figures (6)

Figure 1: In (a), we illustrate mainstream projection paradigms in vision-based 3D perception. In (b), we propose a dual-projection scheme that leverages high-quality fine-grained depth cues to enhance geometric feature representation. In (c-d), we further design an MoE/MoR-style Transformer, adaptively assigning region-specific experts to capture different spatial regions based on distance and height.
Figure 2: The overall architecture of Dr.Occ. T consecutive surround‑view images are processed by MoGe‑2 wang2025moge2 to estimate depth maps, which provide geometric cues for D2-VFormer to construct dense, low‑cost, and geometrically accurate voxel features. These features are then refined in R2-EFormer through recursive semantic decoding. Finally, the refined features are decoded by the OCC Decoder to produce the occupancy prediction.
Figure 3: Depth-guided Dual Projection View Transformer.
Figure 4: (a–b) Spatial semantic distribution reveals strong anisotropy across height and distance. (c–d) R-EFormer partitions 3D space into a 3×3 spatial grid along range (near, mid, far) and height (low, mid, high) dimensions, assigning each region a dedicated expert. (e) R$^2$-EFormer adaptively refines regions through recursive masking.
Figure 5: Benefits of depth-guided dual projection. Our D2-VFormer generates more complete and detailed occupancy predictions (b) by learning geometry-aware features (a), as evidenced by the closer alignment with ground truth.
...and 1 more figures

Dr.Occ: Depth- and Region-Guided 3D Occupancy from Surround-View Cameras for Autonomous Driving

TL;DR

Abstract

Dr.Occ: Depth- and Region-Guided 3D Occupancy from Surround-View Cameras for Autonomous Driving

Authors

TL;DR

Abstract

Table of Contents

Figures (6)