Table of Contents
Fetching ...

OccAny: Generalized Unconstrained Urban 3D Occupancy

Anh-Quan Cao, Tuan-Hung Vu

Abstract

Relying on in-domain annotations and precise sensor-rig priors, existing 3D occupancy prediction methods are limited in both scalability and out-of-domain generalization. While recent visual geometry foundation models exhibit strong generalization capabilities, they were mainly designed for general purposes and lack one or more key ingredients required for urban occupancy prediction, namely metric prediction, geometry completion in cluttered scenes and adaptation to urban scenarios. We address this gap and present OccAny, the first unconstrained urban 3D occupancy model capable of operating on out-of-domain uncalibrated scenes to predict and complete metric occupancy coupled with segmentation features. OccAny is versatile and can predict occupancy from sequential, monocular, or surround-view images. Our contributions are three-fold: (i) we propose the first generalized 3D occupancy framework with (ii) Segmentation Forcing that improves occupancy quality while enabling mask-level prediction, and (iii) a Novel View Rendering pipeline that infers novel-view geometry to enable test-time view augmentation for geometry completion. Extensive experiments demonstrate that OccAny outperforms all visual geometry baselines on 3D occupancy prediction task, while remaining competitive with in-domain self-supervised methods across three input settings on two established urban occupancy prediction datasets. Our code is available at https://github.com/valeoai/OccAny .

OccAny: Generalized Unconstrained Urban 3D Occupancy

Abstract

Relying on in-domain annotations and precise sensor-rig priors, existing 3D occupancy prediction methods are limited in both scalability and out-of-domain generalization. While recent visual geometry foundation models exhibit strong generalization capabilities, they were mainly designed for general purposes and lack one or more key ingredients required for urban occupancy prediction, namely metric prediction, geometry completion in cluttered scenes and adaptation to urban scenarios. We address this gap and present OccAny, the first unconstrained urban 3D occupancy model capable of operating on out-of-domain uncalibrated scenes to predict and complete metric occupancy coupled with segmentation features. OccAny is versatile and can predict occupancy from sequential, monocular, or surround-view images. Our contributions are three-fold: (i) we propose the first generalized 3D occupancy framework with (ii) Segmentation Forcing that improves occupancy quality while enabling mask-level prediction, and (iii) a Novel View Rendering pipeline that infers novel-view geometry to enable test-time view augmentation for geometry completion. Extensive experiments demonstrate that OccAny outperforms all visual geometry baselines on 3D occupancy prediction task, while remaining competitive with in-domain self-supervised methods across three input settings on two established urban occupancy prediction datasets. Our code is available at https://github.com/valeoai/OccAny .
Paper Structure (24 sections, 5 equations, 13 figures, 13 tables)

This paper contains 24 sections, 5 equations, 13 figures, 13 tables.

Figures (13)

  • Figure 1: OccAny is a generalized 3D occupancy model that is trained once and can operate on out-of-domain sequential, monocular, or surround-view urban images. It produces SAM2-like features, enabling promptable segmentation.
  • Figure 2: OccAny Training is done in two stages: (i) 3D Reconstruction infers 3D scene using ${N_{rec}}$ reconstruction frames and (ii) Novel-View Rendering renders geometry of ${N_{rnd}}$ new views having camera poses $\{\mathbf{T}_j\}_{j=1}^{N_{rnd}}$. Segmentation Forcing with SAM2 features helps regularize and improve geometry prediction. The scene memory $\mathbf{M}$ is dynamically updated during reconstruction, while during rendering, the final scene memory output from the reconstruction stage is used without updating
  • Figure 3: OccAny inference undergoes two stages: (i) 3D reconstruction to retrieve ${N_{rec}}$ pointmaps with predicted camera poses $\{\mathbf{v}_i\}_{i=1}^{{N_{rec}}}$, and (ii) novel-view rendering with TTVA sampled along the trajectory of $\{\mathbf{v}_i\}_{i=1}^{{N_{rec}}}$. 3D occupancy is obtained by aggregating all pointmaps and voxelizing them with trilinear interpolation.
  • Figure 4: Occupancy predictions of OccAny and baselines on a sequence and a surround view. We visualize here predicted voxels. For qualitative analysis, we overlay the semantic ground-truth colors on predicted voxels to better highlight class-wise gains. False positive voxels are painted in gray without any overlayed color. Compared to baselines, our occupancy predictions are denser and more accurate.
  • Figure 5: Qualitative ablation shows the gains from Segmentation Forcing and Novel-View Rendering. Voxel colorization follows \ref{['fig:qual_res']}. The two proposed strategies significantly improve the density and the accuracy of occupancy predictions.
  • ...and 8 more figures