Table of Contents
Fetching ...

Estimating Depth of Monocular Panoramic Image with Teacher-Student Model Fusing Equirectangular and Spherical Representations

Jingguo Liu, Yijun Xu, Shigang Li, Jianfeng Li

TL;DR

This work tackles monocular depth estimation for 360° panoramas by addressing ERP distortions through fusion with a spherical representation. It introduces a spherical convolution operating directly on the sphere, a Segmentation Feature Fusion module, and a teacher–student framework that distills latent depth cues from ground-truth during training. The method achieves state-of-the-art results on Matterport3D and Stanford2D3D and remains competitive on 3D60, demonstrating robustness across indoor scenes. The combination of spherical-domain feature extraction, ERP–spherical fusion, and knowledge distillation provides a practical approach to improve depth perception for omnidirectional imagery in robotics and surveillance.

Abstract

Disconnectivity and distortion are the two problems which must be coped with when processing 360 degrees equirectangular images. In this paper, we propose a method of estimating the depth of monocular panoramic image with a teacher-student model fusing equirectangular and spherical representations. In contrast with the existing methods fusing an equirectangular representation with a cube map representation or tangent representation, a spherical representation is a better choice because a sampling on a sphere is more uniform and can also cope with distortion more effectively. In this processing, a novel spherical convolution kernel computing with sampling points on a sphere is developed to extract features from the spherical representation, and then, a Segmentation Feature Fusion(SFF) methodology is utilized to combine the features with ones extracted from the equirectangular representation. In contrast with the existing methods using a teacher-student model to obtain a lighter model of depth estimation, we use a teacher-student model to learn the latent features of depth images. This results in a trained model which estimates the depth map of an equirectangular image using not only the feature maps extracted from an input equirectangular image but also the distilled knowledge learnt from the ground truth of depth map of a training set. In experiments, the proposed method is tested on several well-known 360 monocular depth estimation benchmark datasets, and outperforms the existing methods for the most evaluation indexes.

Estimating Depth of Monocular Panoramic Image with Teacher-Student Model Fusing Equirectangular and Spherical Representations

TL;DR

This work tackles monocular depth estimation for 360° panoramas by addressing ERP distortions through fusion with a spherical representation. It introduces a spherical convolution operating directly on the sphere, a Segmentation Feature Fusion module, and a teacher–student framework that distills latent depth cues from ground-truth during training. The method achieves state-of-the-art results on Matterport3D and Stanford2D3D and remains competitive on 3D60, demonstrating robustness across indoor scenes. The combination of spherical-domain feature extraction, ERP–spherical fusion, and knowledge distillation provides a practical approach to improve depth perception for omnidirectional imagery in robotics and surveillance.

Abstract

Disconnectivity and distortion are the two problems which must be coped with when processing 360 degrees equirectangular images. In this paper, we propose a method of estimating the depth of monocular panoramic image with a teacher-student model fusing equirectangular and spherical representations. In contrast with the existing methods fusing an equirectangular representation with a cube map representation or tangent representation, a spherical representation is a better choice because a sampling on a sphere is more uniform and can also cope with distortion more effectively. In this processing, a novel spherical convolution kernel computing with sampling points on a sphere is developed to extract features from the spherical representation, and then, a Segmentation Feature Fusion(SFF) methodology is utilized to combine the features with ones extracted from the equirectangular representation. In contrast with the existing methods using a teacher-student model to obtain a lighter model of depth estimation, we use a teacher-student model to learn the latent features of depth images. This results in a trained model which estimates the depth map of an equirectangular image using not only the feature maps extracted from an input equirectangular image but also the distilled knowledge learnt from the ground truth of depth map of a training set. In experiments, the proposed method is tested on several well-known 360 monocular depth estimation benchmark datasets, and outperforms the existing methods for the most evaluation indexes.
Paper Structure (20 sections, 7 equations, 5 figures, 3 tables)

This paper contains 20 sections, 7 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Overview of our network
  • Figure 2: (a) Generation process of spherical convolution kernel. With a defined universal rotation matrix, spherical convolution kernels corresponding to different positions can be generated, which greatly reduces the computational cost. (b) Visualizing convolution kernels at the poles and equator positions in ERP images, which enable us to tackle distortion issues in distinct regions.
  • Figure 3: Rotation process
  • Figure 4: The relative position of the spherical convolution kernel for each pixel in the image is stored in the corresponding LUTs, which in turn maps the ERP image to nine sub-images. Then, after group convolution and pixel-wise convolution, an $R^{N_{1}\times H\times W}$ feature map is obtained.
  • Figure 5: Results of qualitative comparison on 3D60 (top), Matterport3D (middle) and Stanford2D3D (bottom).