Table of Contents
Fetching ...

360$^\circ$ High-Resolution Depth Estimation via Uncertainty-aware Structural Knowledge Transfer

Zidong Cao, Hao Ai, Athanasios V. Vasilakos, Lin Wang

TL;DR

This work tackles the problem of predicting high-resolution omnidirectional depth maps from low-resolution ODI inputs without any high-resolution depth ground truth. It introduces a weakly-supervised framework that couples ODI super-resolution with a dedicated scene structural knowledge transfer (SSKT) module, leveraging a Cylindrical Implicit Interpolation Function (CIIF) and a feature distillation loss to transfer structural cues to HR depth estimation. The approach achieves competitive results with fully-supervised methods across multiple datasets and settings, while maintaining zero inference cost for the ODI SR component at test time. The solution is efficient, modular, and adaptable to different backbones and up-sampling factors, making it practical for resource-constrained devices and real-world omnidirectional vision applications.

Abstract

To predict high-resolution (HR) omnidirectional depth map, existing methods typically leverage HR omnidirectional image (ODI) as the input via fully-supervised learning. However, in practice, taking HR ODI as input is undesired due to resource-constrained devices. In addition, depth maps are often with lower resolution than color images. Therefore, in this paper, we explore for the first time to estimate the HR omnidirectional depth directly from a low-resolution (LR) ODI, when no HR depth GT map is available. Our key idea is to transfer the scene structural knowledge from the HR image modality and the corresponding LR depth maps to achieve the goal of HR depth estimation without any extra inference cost. Specifically, we introduce ODI super-resolution (SR) as an auxiliary task and train both tasks collaboratively in a weakly supervised manner to boost the performance of HR depth estimation. The ODI SR task extracts the scene structural knowledge via uncertainty estimation. Buttressed by this, a scene structural knowledge transfer (SSKT) module is proposed with two key components. First, we employ a cylindrical implicit interpolation function (CIIF) to learn cylindrical neural interpolation weights for feature up-sampling and share the parameters of CIIFs between the two tasks. Then, we propose a feature distillation (FD) loss that provides extra structural regularization to help the HR depth estimation task learn more scene structural knowledge. Extensive experiments demonstrate that our weakly-supervised method outperforms baseline methods, and even achieves comparable performance with the fully-supervised methods.

360$^\circ$ High-Resolution Depth Estimation via Uncertainty-aware Structural Knowledge Transfer

TL;DR

This work tackles the problem of predicting high-resolution omnidirectional depth maps from low-resolution ODI inputs without any high-resolution depth ground truth. It introduces a weakly-supervised framework that couples ODI super-resolution with a dedicated scene structural knowledge transfer (SSKT) module, leveraging a Cylindrical Implicit Interpolation Function (CIIF) and a feature distillation loss to transfer structural cues to HR depth estimation. The approach achieves competitive results with fully-supervised methods across multiple datasets and settings, while maintaining zero inference cost for the ODI SR component at test time. The solution is efficient, modular, and adaptable to different backbones and up-sampling factors, making it practical for resource-constrained devices and real-world omnidirectional vision applications.

Abstract

To predict high-resolution (HR) omnidirectional depth map, existing methods typically leverage HR omnidirectional image (ODI) as the input via fully-supervised learning. However, in practice, taking HR ODI as input is undesired due to resource-constrained devices. In addition, depth maps are often with lower resolution than color images. Therefore, in this paper, we explore for the first time to estimate the HR omnidirectional depth directly from a low-resolution (LR) ODI, when no HR depth GT map is available. Our key idea is to transfer the scene structural knowledge from the HR image modality and the corresponding LR depth maps to achieve the goal of HR depth estimation without any extra inference cost. Specifically, we introduce ODI super-resolution (SR) as an auxiliary task and train both tasks collaboratively in a weakly supervised manner to boost the performance of HR depth estimation. The ODI SR task extracts the scene structural knowledge via uncertainty estimation. Buttressed by this, a scene structural knowledge transfer (SSKT) module is proposed with two key components. First, we employ a cylindrical implicit interpolation function (CIIF) to learn cylindrical neural interpolation weights for feature up-sampling and share the parameters of CIIFs between the two tasks. Then, we propose a feature distillation (FD) loss that provides extra structural regularization to help the HR depth estimation task learn more scene structural knowledge. Extensive experiments demonstrate that our weakly-supervised method outperforms baseline methods, and even achieves comparable performance with the fully-supervised methods.
Paper Structure (24 sections, 16 equations, 15 figures, 16 tables, 1 algorithm)

This paper contains 24 sections, 16 equations, 15 figures, 16 tables, 1 algorithm.

Figures (15)

  • Figure 1: HR depth estimation paradigms. (a) Existing fully-supervised methods. (b) Our weakly-supervised method transfers knowledge between two tasks via the SSKT module.
  • Figure 2: Overview of our weakly-supervised framework. Firstly, we introduce an ODI SR task (Sec \ref{['ISR-UL']}) which predicts uncertainty to extract structural knowledge. Then, we design an SSKT module with CIIF that learns cylindrical neural interpolation weights and shares parameters between the two tasks. The SSKT module also includes an FD loss (Sec \ref{['CMKT']}) for feature distillation. Finally, we employ an HR depth estimation task (Sec \ref{['DESR']}) to generate HR depth estimation directly from an LR ODI. The detailed components of CIIF is shown in Fig. \ref{['fig:CIIF']}(b).
  • Figure 3: (a) LIIF predicts the RGB value based on the Cartesian coordinates. (b) Our CIIF learns neural interpolation weights based on the cylindrical coordinates. The parameters of CIIF $g_{\theta}$ are contained in the MLP shared from the ODI SR task.
  • Figure 4: An illustration of feature distillation (FD) loss.
  • Figure 5: Visual comparison of fully-supervised method (i.e., UniFuse-Fusion) and ours on Matterport3D dataset. Best viewed in color.
  • ...and 10 more figures