Table of Contents
Fetching ...

An End-to-End Room Geometry Constrained Depth Estimation Framework for Indoor Panorama Images

Kanglin Ning, Ruzhao Chen, Penghong Wang, Xingtao Wang, Ruiqin Xiong, Xiaopeng Fan

TL;DR

This work tackles indoor panoramic depth estimation by incorporating room geometry priors through a multi-task network, RGCNet, which jointly predicts depth, room layout, and background segmentation. A room-geometry-based background depth resolving strategy and a segmentation-guided fusion mechanism refine depth by leveraging layout geometry and scene masks, all within an end-to-end framework built on a PanoFormer backbone. Extensive experiments on Stanford2D3D, Matterport3D, and Structured3D demonstrate significant improvements in RMSE and depth consistency, aided by a dataset denoise strategy that stabilizes training on real-world noisy data. The approach advances robust 3D understanding from $360°$ indoor panoramas and offers practical impact for AR/VR, robotics, and indoor scene reconstruction.

Abstract

Predicting spherical pixel depth from monocular $360^{\circ}$ indoor panoramas is critical for many vision applications. However, existing methods focus on pixel-level accuracy, causing oversmoothed room corners and noise sensitivity. In this paper, we propose a depth estimation framework based on room geometry constraints, which extracts room geometry information through layout prediction and integrates those information into the depth estimation process through background segmentation mechanism. At the model level, our framework comprises a shared feature encoder followed by task-specific decoders for layout estimation, depth estimation, and background segmentation. The shared encoder extracts multi-scale features, which are subsequently processed by individual decoders to generate initial predictions: a depth map, a room layout map, and a background segmentation map. Furthermore, our framework incorporates two strategies: a room geometry-based background depth resolving strategy and a background-segmentation-guided fusion mechanism. The proposed room-geometry-based background depth resolving strategy leverages the room layout and the depth decoder's output to generate the corresponding background depth map. Then, a background-segmentation-guided fusion strategy derives fusion weights for the background and coarse depth maps from the segmentation decoder's predictions. Extensive experimental results on the Stanford2D3D, Matterport3D and Structured3D datasets show that our proposed methods can achieve significantly superior performance than current open-source methods. Our code is available at https://github.com/emiyaning/RGCNet.

An End-to-End Room Geometry Constrained Depth Estimation Framework for Indoor Panorama Images

TL;DR

This work tackles indoor panoramic depth estimation by incorporating room geometry priors through a multi-task network, RGCNet, which jointly predicts depth, room layout, and background segmentation. A room-geometry-based background depth resolving strategy and a segmentation-guided fusion mechanism refine depth by leveraging layout geometry and scene masks, all within an end-to-end framework built on a PanoFormer backbone. Extensive experiments on Stanford2D3D, Matterport3D, and Structured3D demonstrate significant improvements in RMSE and depth consistency, aided by a dataset denoise strategy that stabilizes training on real-world noisy data. The approach advances robust 3D understanding from indoor panoramas and offers practical impact for AR/VR, robotics, and indoor scene reconstruction.

Abstract

Predicting spherical pixel depth from monocular indoor panoramas is critical for many vision applications. However, existing methods focus on pixel-level accuracy, causing oversmoothed room corners and noise sensitivity. In this paper, we propose a depth estimation framework based on room geometry constraints, which extracts room geometry information through layout prediction and integrates those information into the depth estimation process through background segmentation mechanism. At the model level, our framework comprises a shared feature encoder followed by task-specific decoders for layout estimation, depth estimation, and background segmentation. The shared encoder extracts multi-scale features, which are subsequently processed by individual decoders to generate initial predictions: a depth map, a room layout map, and a background segmentation map. Furthermore, our framework incorporates two strategies: a room geometry-based background depth resolving strategy and a background-segmentation-guided fusion mechanism. The proposed room-geometry-based background depth resolving strategy leverages the room layout and the depth decoder's output to generate the corresponding background depth map. Then, a background-segmentation-guided fusion strategy derives fusion weights for the background and coarse depth maps from the segmentation decoder's predictions. Extensive experimental results on the Stanford2D3D, Matterport3D and Structured3D datasets show that our proposed methods can achieve significantly superior performance than current open-source methods. Our code is available at https://github.com/emiyaning/RGCNet.

Paper Structure

This paper contains 23 sections, 9 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: 3D visualization of panorama depth estimator's predictions on the Stanford2d3d dataset. The left side shows the ground-truth visualization, middle column shows the visualization of Panoformer's prediction, right side shows the visualization of our framework's prediction.
  • Figure 2: The structure diagram of our proposed room geometry guided depth estimation framework. In terms of model structure, our framework includes a shared panorama encoder and three task-corresponding decoders. Based on the obtained layout map, coarse depth map, and background-segmentation map, the proposed framework decode fine-grained depth prediction.
  • Figure 3: The structure diagram of layout feature aggregation module and pixel feature aggregation module.
  • Figure 4: P is the camera center, A and B are the upper and lower boundary points of the wall corresponding to point P, and D is an arbitrary point on the wall plane. The lengths of AB in the image are known, and the corresponding angles $\phi_{c}$ and $\phi_{f}$ can be calculated based on spherical camera geometry. Based on the depth of point P predicted by the depth decoder, $d_c$ and $d_f$ can be calculated.
  • Figure 5: The 3D visualize results of final depth estimation. For each scene, we selected three perspectives: top view, side view, and internal perspective to display the three-dimensional visualization effect of the point cloud converted from the depth map.
  • ...and 4 more figures