Table of Contents
Fetching ...

ViPOcc: Leveraging Visual Priors from Vision Foundation Models for Single-View 3D Occupancy Prediction

Yi Feng, Yu Han, Xijing Zhang, Tanghui Li, Yanting Zhang, Rui Fan

TL;DR

ViPOcc tackles the ill-posed problem of inferring 3D occupancy from a single image by leveraging visual priors from vision foundation models. It introduces two coupled branches: a metric depth estimation path that uses inverse depth alignment to reconcile VFM priors with ground-truth depth, and a 3D occupancy path that employs a Grounded-SAM–guided SNOG sampler for instance-aware, non-overlapping ray sampling. The training objective combines a temporal alignment loss and a reconstruction consistency loss to enforce cross-frame photometric and intra-frame geometric coherence, enabling robust single-view 3D reconstruction. Empirical results on KITTI-360, KITTI Raw, and zero-shot evaluation on DDAD demonstrate SoTA performance in both monocular depth estimation and 3D occupancy prediction, with practical benefits for continuousScene understanding in autonomous driving scenarios.

Abstract

Inferring the 3D structure of a scene from a single image is an ill-posed and challenging problem in the field of vision-centric autonomous driving. Existing methods usually employ neural radiance fields to produce voxelized 3D occupancy, lacking instance-level semantic reasoning and temporal photometric consistency. In this paper, we propose ViPOcc, which leverages the visual priors from vision foundation models (VFMs) for fine-grained 3D occupancy prediction. Unlike previous works that solely employ volume rendering for RGB and depth image reconstruction, we introduce a metric depth estimation branch, in which an inverse depth alignment module is proposed to bridge the domain gap in depth distribution between VFM predictions and the ground truth. The recovered metric depth is then utilized in temporal photometric alignment and spatial geometric alignment to ensure accurate and consistent 3D occupancy prediction. Additionally, we also propose a semantic-guided non-overlapping Gaussian mixture sampler for efficient, instance-aware ray sampling, which addresses the redundant and imbalanced sampling issue that still exists in previous state-of-the-art methods. Extensive experiments demonstrate the superior performance of ViPOcc in both 3D occupancy prediction and depth estimation tasks on the KITTI-360 and KITTI Raw datasets. Our code is available at: \url{https://mias.group/ViPOcc}.

ViPOcc: Leveraging Visual Priors from Vision Foundation Models for Single-View 3D Occupancy Prediction

TL;DR

ViPOcc tackles the ill-posed problem of inferring 3D occupancy from a single image by leveraging visual priors from vision foundation models. It introduces two coupled branches: a metric depth estimation path that uses inverse depth alignment to reconcile VFM priors with ground-truth depth, and a 3D occupancy path that employs a Grounded-SAM–guided SNOG sampler for instance-aware, non-overlapping ray sampling. The training objective combines a temporal alignment loss and a reconstruction consistency loss to enforce cross-frame photometric and intra-frame geometric coherence, enabling robust single-view 3D reconstruction. Empirical results on KITTI-360, KITTI Raw, and zero-shot evaluation on DDAD demonstrate SoTA performance in both monocular depth estimation and 3D occupancy prediction, with practical benefits for continuousScene understanding in autonomous driving scenarios.

Abstract

Inferring the 3D structure of a scene from a single image is an ill-posed and challenging problem in the field of vision-centric autonomous driving. Existing methods usually employ neural radiance fields to produce voxelized 3D occupancy, lacking instance-level semantic reasoning and temporal photometric consistency. In this paper, we propose ViPOcc, which leverages the visual priors from vision foundation models (VFMs) for fine-grained 3D occupancy prediction. Unlike previous works that solely employ volume rendering for RGB and depth image reconstruction, we introduce a metric depth estimation branch, in which an inverse depth alignment module is proposed to bridge the domain gap in depth distribution between VFM predictions and the ground truth. The recovered metric depth is then utilized in temporal photometric alignment and spatial geometric alignment to ensure accurate and consistent 3D occupancy prediction. Additionally, we also propose a semantic-guided non-overlapping Gaussian mixture sampler for efficient, instance-aware ray sampling, which addresses the redundant and imbalanced sampling issue that still exists in previous state-of-the-art methods. Extensive experiments demonstrate the superior performance of ViPOcc in both 3D occupancy prediction and depth estimation tasks on the KITTI-360 and KITTI Raw datasets. Our code is available at: \url{https://mias.group/ViPOcc}.

Paper Structure

This paper contains 39 sections, 17 equations, 10 figures, 11 tables.

Figures (10)

  • Figure 1: Single-view 3D scene reconstruction results. KYN know2024li struggles to recover clear object boundaries (green boxes) and exhibits poor reconstruction performance for distant objects (blue circles). ViPOcc outperforms KYN in both monocular depth estimation and 3D occupancy prediction tasks.
  • Figure 2: An illustration of our proposed ViPOcc framework. Unlike previous approaches that rely solely on NeRF for 3D scene reconstruction, ViPOcc introduces an additional depth prediction branch and an instance-aware SNOG sampler for temporal photometric alignment and spatial geometric alignment.
  • Figure 3: An illustration of our proposed SNOG sampler.
  • Figure 4: Qualitative comparison of 3D occupancy prediction on the KITTI-360 dataset: (a) input RGB images; (b) BTS results; (c) KYN results; (d) our results. A darker voxel color indicates a lower altitude.
  • Figure 5: Depth distribution comparison.
  • ...and 5 more figures