Table of Contents
Fetching ...

Monocular Occupancy Prediction for Scalable Indoor Scenes

Hongxiao Yu, Yuqi Wang, Yuntao Chen, Zhaoxiang Zhang

TL;DR

This work addresses the challenge of monocular 3D occupancy prediction in indoor scenes, where scene-scale variation and object density complicate depth estimation. The proposed ISO framework combines a Depth Branch with a Dual Feature Line of Sight Projection (D-FLoSP) and a multi-scale feature fusion strategy to lift 2D image cues into 3D voxel features, enabling accurate occupancy and semantics from a single image. To support scalable indoor research, the authors introduce Occ-ScanNet, a large-scale indoor occupancy benchmark derived from ScanNet, significantly enlarging the data pool over NYUv2. Across NYUv2 and Occ-ScanNet, ISO achieves state-of-the-art results, demonstrating strong generalization and scalability potential for indoor occupancy tasks, with public release of code and dataset. Overall, this work advances monocular indoor scene understanding by integrating depth-guided 2D-to-3D feature transformation and multi-scale depth fusion in a voxel-based occupancy framework.

Abstract

Camera-based 3D occupancy prediction has recently garnered increasing attention in outdoor driving scenes. However, research in indoor scenes remains relatively unexplored. The core differences in indoor scenes lie in the complexity of scene scale and the variance in object size. In this paper, we propose a novel method, named ISO, for predicting indoor scene occupancy using monocular images. ISO harnesses the advantages of a pretrained depth model to achieve accurate depth predictions. Furthermore, we introduce the Dual Feature Line of Sight Projection (D-FLoSP) module within ISO, which enhances the learning of 3D voxel features. To foster further research in this domain, we introduce Occ-ScanNet, a large-scale occupancy benchmark for indoor scenes. With a dataset size 40 times larger than the NYUv2 dataset, it facilitates future scalable research in indoor scene analysis. Experimental results on both NYUv2 and Occ-ScanNet demonstrate that our method achieves state-of-the-art performance. The dataset and code are made publicly at https://github.com/hongxiaoy/ISO.git.

Monocular Occupancy Prediction for Scalable Indoor Scenes

TL;DR

This work addresses the challenge of monocular 3D occupancy prediction in indoor scenes, where scene-scale variation and object density complicate depth estimation. The proposed ISO framework combines a Depth Branch with a Dual Feature Line of Sight Projection (D-FLoSP) and a multi-scale feature fusion strategy to lift 2D image cues into 3D voxel features, enabling accurate occupancy and semantics from a single image. To support scalable indoor research, the authors introduce Occ-ScanNet, a large-scale indoor occupancy benchmark derived from ScanNet, significantly enlarging the data pool over NYUv2. Across NYUv2 and Occ-ScanNet, ISO achieves state-of-the-art results, demonstrating strong generalization and scalability potential for indoor occupancy tasks, with public release of code and dataset. Overall, this work advances monocular indoor scene understanding by integrating depth-guided 2D-to-3D feature transformation and multi-scale depth fusion in a voxel-based occupancy framework.

Abstract

Camera-based 3D occupancy prediction has recently garnered increasing attention in outdoor driving scenes. However, research in indoor scenes remains relatively unexplored. The core differences in indoor scenes lie in the complexity of scene scale and the variance in object size. In this paper, we propose a novel method, named ISO, for predicting indoor scene occupancy using monocular images. ISO harnesses the advantages of a pretrained depth model to achieve accurate depth predictions. Furthermore, we introduce the Dual Feature Line of Sight Projection (D-FLoSP) module within ISO, which enhances the learning of 3D voxel features. To foster further research in this domain, we introduce Occ-ScanNet, a large-scale occupancy benchmark for indoor scenes. With a dataset size 40 times larger than the NYUv2 dataset, it facilitates future scalable research in indoor scene analysis. Experimental results on both NYUv2 and Occ-ScanNet demonstrate that our method achieves state-of-the-art performance. The dataset and code are made publicly at https://github.com/hongxiaoy/ISO.git.
Paper Structure (32 sections, 8 equations, 6 figures, 5 tables)

This paper contains 32 sections, 8 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: The core design of ISO centers around the transformation of features from 2D to 3D spaces, encompassing the Depth Branch and the D-FLoSP module. A depth branch is initially integrated, it leverages a pre-trained depth model to estimate a pixel-wise depth map which is processed by the DepthNet to generate the final depth distribution. An element-wise multiplication between the voxel depth and features followed by summation are subsequently performed to derive the initial 3D voxel feature. The 3D feature is further processed to predict the 3D scene occupancy.
  • Figure 2: Comparison of NYUv2 and Occ-ScanNet Benchmark. In (a), the depth ranges of NYUv2 and Occ-ScanNet are distinguished by dark and light green, respectively, with the horizontal axis indicating the minimum depth and the vertical axis showing the maximum depth of scenes. (b) quantitatively demonstrates that Occ-ScanNet possesses a significantly larger data scale compared to the NYUv2 dataset.
  • Figure 3: Samples Visualization in Occ-ScanNet Benchmark. The original RGB image is shown in column 1,3 and 5, the corresponding scene voxel labels is shown in column 2, 4 and 6. The first two rows are different views from different scenes and the last two rows each is three different views from the same scene.
  • Figure 4: Pipeline of Occ-ScanNet dataset label generation. Color images, depth images, camera intrinsic and poses are extracted from ScanNet scenes. For each scene, 100 frames were sampled and randomly split into training and validation sets with a 7/3 ratio. Frames with invalid camera poses or exceeding scene boundaries were filtered out. Only the area in front of the camera was analyzed, necessitating careful selection of the voxel origin. Voxel were labeled based on their nearest voxel in the CompleteScanNet dataset. Frames with >95% unknown/empty labels or <2 semantic classes were excluded, resulting in generated 3D voxel labels for each frame.
  • Figure 5: Qualitative Analysis on the Occ-ScanNet Dataset. The input image is displayed on the left, while the predicted scene is shown in the middle two column, and ground truth on the right column.
  • ...and 1 more figures