Table of Contents
Fetching ...

PanoNormal: Monocular Indoor 360° Surface Normal Estimation

Kun Huang, Fanglue Zhang, Neil Dodgson

TL;DR

The paper tackles monocular 360° surface normal estimation under spherical distortions in equirectangular imagery. It introduces PanoNormal, a CNN–ViT hybrid with distortion-aware tangent sampling and a multi-level decoder to jointly capture local geometry and global context, optimized by a balanced multi-term loss. Across five public 360° indoor datasets, it achieves state-of-the-art results and strong generalization to real-world data, outperforming adapted depth-based and perspective ViT approaches. The work demonstrates the importance of task-specific architectural design for 360° normals and provides a solid, adaptable baseline for panoramic scene understanding in practical applications.

Abstract

The presence of spherical distortion in equirectangular projection (ERP) images presents a persistent challenge in dense regression tasks such as surface normal estimation. Although it may appear straightforward to repurpose architectures developed for 360° depth estimation, our empirical findings indicate that such models yield suboptimal performance when applied to surface normal prediction. This is largely attributed to their architectural bias toward capturing global scene layout, which comes at the expense of the fine-grained local geometric cues that are critical for accurate surface orientation estimation. While convolutional neural networks (CNNs) have been employed to mitigate spherical distortion, their fixed receptive fields limit their ability to capture holistic scene structure. Conversely, vision transformers (ViTs) are capable of modeling long-range dependencies via global self-attention, but often fail to preserve high-frequency local detail. To address these limitations, we propose \textit{PanoNormal}, a monocular surface normal estimation architecture for 360° images that integrates the complementary strengths of CNNs and ViTs. In particular, we design a multi-level global self-attention mechanism that explicitly accounts for the spherical feature distribution, enabling our model to recover both global contextual structure and local geometric details. Experimental results demonstrate that our method not only achieves state-of-the-art performance on several benchmark 360° datasets, but also significantly outperforms adapted depth estimation models on the task of surface normal prediction. The code and model are available at https://github.com/huangkun101230/PanoNormal.

PanoNormal: Monocular Indoor 360° Surface Normal Estimation

TL;DR

The paper tackles monocular 360° surface normal estimation under spherical distortions in equirectangular imagery. It introduces PanoNormal, a CNN–ViT hybrid with distortion-aware tangent sampling and a multi-level decoder to jointly capture local geometry and global context, optimized by a balanced multi-term loss. Across five public 360° indoor datasets, it achieves state-of-the-art results and strong generalization to real-world data, outperforming adapted depth-based and perspective ViT approaches. The work demonstrates the importance of task-specific architectural design for 360° normals and provides a solid, adaptable baseline for panoramic scene understanding in practical applications.

Abstract

The presence of spherical distortion in equirectangular projection (ERP) images presents a persistent challenge in dense regression tasks such as surface normal estimation. Although it may appear straightforward to repurpose architectures developed for 360° depth estimation, our empirical findings indicate that such models yield suboptimal performance when applied to surface normal prediction. This is largely attributed to their architectural bias toward capturing global scene layout, which comes at the expense of the fine-grained local geometric cues that are critical for accurate surface orientation estimation. While convolutional neural networks (CNNs) have been employed to mitigate spherical distortion, their fixed receptive fields limit their ability to capture holistic scene structure. Conversely, vision transformers (ViTs) are capable of modeling long-range dependencies via global self-attention, but often fail to preserve high-frequency local detail. To address these limitations, we propose \textit{PanoNormal}, a monocular surface normal estimation architecture for 360° images that integrates the complementary strengths of CNNs and ViTs. In particular, we design a multi-level global self-attention mechanism that explicitly accounts for the spherical feature distribution, enabling our model to recover both global contextual structure and local geometric details. Experimental results demonstrate that our method not only achieves state-of-the-art performance on several benchmark 360° datasets, but also significantly outperforms adapted depth estimation models on the task of surface normal prediction. The code and model are available at https://github.com/huangkun101230/PanoNormal.
Paper Structure (22 sections, 7 equations, 7 figures, 5 tables)

This paper contains 22 sections, 7 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Our PanoNormal method produces more accurate normal estimation predictions compared to the current state-of-the-art method, particularly in the areas highlighted by the red rectangle. For better visualization, we provide a 3D point cloud generated from the ground truth depth.
  • Figure 2: Top: the overall architecture of the proposed PanoNormal method. Bottom: the key components: (a) The distortion-aware sampling process on the tangent patch, its transformation to the target ERP domain, and the application of a self-attention scheme among the tokens within each patch. A learnable token flow facilitates attention among the patches. (b) The proposed hierarchical multi-level transformer decoder, which produces results in different scales for comprehensive learning.
  • Figure 3: Qualitative comparisons across five benchmarks, featuring PanoNormal, UniFuse, PanoFormer, OmniFusion, MonoViT, HyperSphere, and 360MTL. Optimal viewing experience in color.
  • Figure 4: Normal estimation predictions on some real-world data. The images are from the SUN360 xiao2012recognizing dataset. More qualitative results can be found in our supplementary materials.
  • Figure 5: Surface normal estimation predictions on Stanford2D3D data are compared between ASNGeo long2024adaptive (a perspective-based method) and our approach. To further validate the conversion process, we used the depth-to-surface-normal technique from ASNGeo long2024adaptive to convert both perspective depth maps and GLPanoDepth bai2024glpanodepth (360° domain) depth maps into surface normals.
  • ...and 2 more figures