Rebenchmarking Unsupervised Monocular 3D Occupancy Prediction

Zizhan Guo; Yi Feng; Mengtan Zhang; Haoran Zhang; Wei Ye; Rui Fan

Rebenchmarking Unsupervised Monocular 3D Occupancy Prediction

Zizhan Guo, Yi Feng, Mengtan Zhang, Haoran Zhang, Wei Ye, Rui Fan

TL;DR

The paper tackles unsupervised monocular 3D occupancy prediction by addressing a fundamental training-evaluation mismatch: NeRF-style density outputs are not directly compatible with voxel-wise 3D occupancy ground truth, especially in occluded regions. It introduces an interpretable occupancy representation based on opacity $\alpha$, and a coordinate-transformed occupancy sampling (CTS) to align predictions with voxel grids, enabling fair 3D evaluation without 2D supervision. Additionally, it adds an occlusion-aware occupancy polarization mechanism that leverages multi-view visual cues to provide explicit supervision in occluded areas, improving learning where photometric signals are weak. Extensive experiments on KITTI-360 with 3D ground truth from SSCBench-KITTI-360 demonstrate state-of-the-art unsupervised performance, with results competitive or superior to supervised baselines on several metrics, and a clear demonstration of improved occlusion reasoning and generalization, including zero-shot tests on SemanticKITTI. The framework offers a practical, interpretable benchmark that bridges NeRF-based unsupervised learning with voxel-level 3D occupancy evaluation, paving the way for more reliable 3D scene understanding in autonomous systems.

Abstract

Inferring the 3D structure from a single image, particularly in occluded regions, remains a fundamental yet unsolved challenge in vision-centric autonomous driving. Existing unsupervised approaches typically train a neural radiance field and treat the network outputs as occupancy probabilities during evaluation, overlooking the inconsistency between training and evaluation protocols. Moreover, the prevalent use of 2D ground truth fails to reveal the inherent ambiguity in occluded regions caused by insufficient geometric constraints. To address these issues, this paper presents a reformulated benchmark for unsupervised monocular 3D occupancy prediction. We first interpret the variables involved in the volume rendering process and identify the most physically consistent representation of the occupancy probability. Building on these analyses, we improve existing evaluation protocols by aligning the newly identified representation with voxel-wise 3D occupancy ground truth, thereby enabling unsupervised methods to be evaluated in a manner consistent with that of supervised approaches. Additionally, to impose explicit constraints in occluded regions, we introduce an occlusion-aware polarization mechanism that incorporates multi-view visual cues to enhance discrimination between occupied and free spaces in these regions. Extensive experiments demonstrate that our approach not only significantly outperforms existing unsupervised approaches but also matches the performance of supervised ones. Our source code and evaluation protocol will be made available upon publication.

Rebenchmarking Unsupervised Monocular 3D Occupancy Prediction

TL;DR

, and a coordinate-transformed occupancy sampling (CTS) to align predictions with voxel grids, enabling fair 3D evaluation without 2D supervision. Additionally, it adds an occlusion-aware occupancy polarization mechanism that leverages multi-view visual cues to provide explicit supervision in occluded areas, improving learning where photometric signals are weak. Extensive experiments on KITTI-360 with 3D ground truth from SSCBench-KITTI-360 demonstrate state-of-the-art unsupervised performance, with results competitive or superior to supervised baselines on several metrics, and a clear demonstration of improved occlusion reasoning and generalization, including zero-shot tests on SemanticKITTI. The framework offers a practical, interpretable benchmark that bridges NeRF-based unsupervised learning with voxel-level 3D occupancy evaluation, paving the way for more reliable 3D scene understanding in autonomous systems.

Abstract

Paper Structure (30 sections, 22 equations, 7 figures, 7 tables)

This paper contains 30 sections, 22 equations, 7 figures, 7 tables.

Introduction
Related Work
Supervised 3D Occupancy Prediction
Unsupervised 3D Occupancy Prediction
Benchmarks for 3D Occupancy Prediction
Methodology
Problem Setup
Occupancy Probability Interpretation for NeRF
Coordinate-Transformed Occupancy Sampling
Occlusion-Aware Occupancy Polarization
Experiments
Benchmark
Dataset Temporal Alignment
Transformation between Coordinate Systems
Mask Generation
...and 15 more sections

Figures (7)

Figure 1: A comparison between the network output $\sigma$ and the opacity $\alpha$ during inference: (a) two representative sampled rays; (b) $\sigma$ distributions; (c) $\alpha$ distributions. For point A, which transitions from occupied to free space, $\alpha_A$ is bounded within the range $(0, 1)$, whereas $\sigma_A$ has no upper bound, making $\alpha$ a more suitable representation for occupancy probability; For points B and C with identical occupancy status, their discrepancy in $\sigma$ is significantly greater than that in $\alpha$, demonstrating that our proposed representation for occupancy probability effectively eliminates the magnitude variation caused by non-uniform point sampling.
Figure 2: The occupancy sampling algorithm in the camera coordinate system (CCS) and the transformed coordinate system (TCS): (a) network inference with sampled points as input; (b) opacity distribution v.s. the voxel grid in the CCS; (c) opacity sampling using voxel centers in the TCS.
Figure 3: An illustration of the occlusion-aware occupancy polarization mechanism. For adjacent occluded points in the target view, the discrepancy in their sampled colors from the source view indicates that the colors likely originate from distinct objects. The proposed mechanism amplifies the occupancy differences between such points, enabling the network to refine predictions in occluded regions.
Figure 4: An illustration of the coordinate system transformation is provided in the bird’s-eye view of the occupancy ground truth. Specifically, it depicts the transformation from the voxel coordinate system associated with the $j$-th frame to the camera coordinate system of the $i$-th frame. This transformation enables subsequent computation of the frustum mask and the visibility mask.
Figure 5: Qualitative comparisons of 3D occupancy prediction on the KITTI-360 dataset: (a) input RGB images; (b) BTS results; (c) ViPOcc results; (d) our results.
...and 2 more figures

Rebenchmarking Unsupervised Monocular 3D Occupancy Prediction

TL;DR

Abstract

Rebenchmarking Unsupervised Monocular 3D Occupancy Prediction

Authors

TL;DR

Abstract

Table of Contents

Figures (7)