Table of Contents
Fetching ...

Multi-task Geometric Estimation of Depth and Surface Normal from Monocular 360° Images

Kun Huang, Fang-Lue Zhang, Fangfang Zhang, Yu-Kun Lai, Paul L. Rosin, Neil A. Dodgson

TL;DR

This work tackles 360° monocular geometry by jointly estimating depth and surface normals from ERP panoramas. It introduces a distortion-aware multi-task transformer framework with a fusion module that enables soft parameter sharing between depth and normal branches, plus a multi-scale spherical decoder to capture both fine and global geometry. The approach achieves state-of-the-art results across five panoramic benchmarks and demonstrates robust generalization, albeit with some limitations on reflective materials. The model offers practical benefits for indoor scene understanding, robot navigation, and 3D reconstruction by providing coherent, dense geometric cues in challenging 360° environments.

Abstract

Geometric estimation is required for scene understanding and analysis in panoramic 360° images. Current methods usually predict a single feature, such as depth or surface normal. These methods can lack robustness, especially when dealing with intricate textures or complex object surfaces. We introduce a novel multi-task learning (MTL) network that simultaneously estimates depth and surface normals from 360° images. Our first innovation is our MTL architecture, which enhances predictions for both tasks by integrating geometric information from depth and surface normal estimation, enabling a deeper understanding of 3D scene structure. Another innovation is our fusion module, which bridges the two tasks, allowing the network to learn shared representations that improve accuracy and robustness. Experimental results demonstrate that our MTL architecture significantly outperforms state-of-the-art methods in both depth and surface normal estimation, showing superior performance in complex and diverse scenes. Our model's effectiveness and generalizability, particularly in handling intricate surface textures, establish it as a new benchmark in 360° image geometric estimation. The code and model are available at \url{https://github.com/huangkun101230/360MTLGeometricEstimation}.

Multi-task Geometric Estimation of Depth and Surface Normal from Monocular 360° Images

TL;DR

This work tackles 360° monocular geometry by jointly estimating depth and surface normals from ERP panoramas. It introduces a distortion-aware multi-task transformer framework with a fusion module that enables soft parameter sharing between depth and normal branches, plus a multi-scale spherical decoder to capture both fine and global geometry. The approach achieves state-of-the-art results across five panoramic benchmarks and demonstrates robust generalization, albeit with some limitations on reflective materials. The model offers practical benefits for indoor scene understanding, robot navigation, and 3D reconstruction by providing coherent, dense geometric cues in challenging 360° environments.

Abstract

Geometric estimation is required for scene understanding and analysis in panoramic 360° images. Current methods usually predict a single feature, such as depth or surface normal. These methods can lack robustness, especially when dealing with intricate textures or complex object surfaces. We introduce a novel multi-task learning (MTL) network that simultaneously estimates depth and surface normals from 360° images. Our first innovation is our MTL architecture, which enhances predictions for both tasks by integrating geometric information from depth and surface normal estimation, enabling a deeper understanding of 3D scene structure. Another innovation is our fusion module, which bridges the two tasks, allowing the network to learn shared representations that improve accuracy and robustness. Experimental results demonstrate that our MTL architecture significantly outperforms state-of-the-art methods in both depth and surface normal estimation, showing superior performance in complex and diverse scenes. Our model's effectiveness and generalizability, particularly in handling intricate surface textures, establish it as a new benchmark in 360° image geometric estimation. The code and model are available at \url{https://github.com/huangkun101230/360MTLGeometricEstimation}.

Paper Structure

This paper contains 29 sections, 9 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Our MTL model provides more accurate geometric estimations for 360° images compared to other methods, particularly in the red rectangle highlighted regions. The results are visualized as 3D point clouds, with both RGB data and color-coded surface normal maps.
  • Figure 2: Our network architecture. Our network consists of two branches: $B_{depth}$ (in blue) and $B_{normal}$ (in red), dedicated to depth and surface normal estimation, respectively. A fusion module (in green) is employed to fuse the feature maps between each encoder level of $B_{depth}$ and $B_{normal}$ and feed the fused features into the next encoder level. The fused features are also concatenated with the original depth or normal features and fed to the corresponding decoder blocks. The final depth and normal maps are predicted in a multi-scale manner.
  • Figure 3: Our proposed fusion module for fusing 360° depth and surface normal features.
  • Figure 4: Qualitative 360° depth comparisons were conducted on diverse datasets including 3D60, Stanford2D3D, Matterport3D, SunCG, and Structured3D. The areas outlined in red highlight regions where our approach notably enhances object boundaries, providing a more accurate representation of the overall scene geometry.
  • Figure 5: Qualitative 360° surface normal comparisons among HyperSphere, ASNGeo, the adapted UniFuse, PanoFormer, OmniFusion and our method. The best viewing experience in color.
  • ...and 2 more figures