Table of Contents
Fetching ...

Depth Anywhere: Enhancing 360 Monocular Depth Estimation via Perspective Distillation and Unlabeled Data Augmentation

Ning-Hsu Wang, Yu-Lun Liu

TL;DR

This work proposes a new depth estimation framework that utilizes unlabeled 360-degree data effectively and uses state-of-the-art perspective depth estimation models as teacher models to generate pseudo labels through a six-face cube projection technique, enabling efficient labeling of depth in 360-degree images.

Abstract

Accurately estimating depth in 360-degree imagery is crucial for virtual reality, autonomous navigation, and immersive media applications. Existing depth estimation methods designed for perspective-view imagery fail when applied to 360-degree images due to different camera projections and distortions, whereas 360-degree methods perform inferior due to the lack of labeled data pairs. We propose a new depth estimation framework that utilizes unlabeled 360-degree data effectively. Our approach uses state-of-the-art perspective depth estimation models as teacher models to generate pseudo labels through a six-face cube projection technique, enabling efficient labeling of depth in 360-degree images. This method leverages the increasing availability of large datasets. Our approach includes two main stages: offline mask generation for invalid regions and an online semi-supervised joint training regime. We tested our approach on benchmark datasets such as Matterport3D and Stanford2D3D, showing significant improvements in depth estimation accuracy, particularly in zero-shot scenarios. Our proposed training pipeline can enhance any 360 monocular depth estimator and demonstrates effective knowledge transfer across different camera projections and data types. See our project page for results: https://albert100121.github.io/Depth-Anywhere/

Depth Anywhere: Enhancing 360 Monocular Depth Estimation via Perspective Distillation and Unlabeled Data Augmentation

TL;DR

This work proposes a new depth estimation framework that utilizes unlabeled 360-degree data effectively and uses state-of-the-art perspective depth estimation models as teacher models to generate pseudo labels through a six-face cube projection technique, enabling efficient labeling of depth in 360-degree images.

Abstract

Accurately estimating depth in 360-degree imagery is crucial for virtual reality, autonomous navigation, and immersive media applications. Existing depth estimation methods designed for perspective-view imagery fail when applied to 360-degree images due to different camera projections and distortions, whereas 360-degree methods perform inferior due to the lack of labeled data pairs. We propose a new depth estimation framework that utilizes unlabeled 360-degree data effectively. Our approach uses state-of-the-art perspective depth estimation models as teacher models to generate pseudo labels through a six-face cube projection technique, enabling efficient labeling of depth in 360-degree images. This method leverages the increasing availability of large datasets. Our approach includes two main stages: offline mask generation for invalid regions and an online semi-supervised joint training regime. We tested our approach on benchmark datasets such as Matterport3D and Stanford2D3D, showing significant improvements in depth estimation accuracy, particularly in zero-shot scenarios. Our proposed training pipeline can enhance any 360 monocular depth estimator and demonstrates effective knowledge transfer across different camera projections and data types. See our project page for results: https://albert100121.github.io/Depth-Anywhere/
Paper Structure (30 sections, 8 equations, 12 figures, 8 tables)

This paper contains 30 sections, 8 equations, 12 figures, 8 tables.

Figures (12)

  • Figure 1: Our proposed training pipeline improves existing 360 monocular depth estimators. This figure demonstrated the improvement of our proposed training pipeline tested on the Stanford2D3D Stanford2D3D dataset in a zero-shot setting.
  • Figure 2: Training Pipeline. Our proposed training pipeline involves joint training on both labeled 360 data with ground truth and unlabeled 360 data. (a) For labeled data, we train our 360 depth model with the loss between depth prediction and ground truth. (b) For unlabeled data, we propose to distill knowledge from a pre-trained perspective-view monocular depth estimator. In this paper, we use Depth Anything depthanything to generate pseudo ground truth for training. However, more advanced techniques could be applied. These perspective-view monocular depth estimators fail to produce reasonable equirectangular depth as a domain gap exists. Therefore, we distill knowledge by inferring six perspective cube faces and passing them through perspective-view monocular depth estimators. To ensure stable and effective training, we propose generating a valid pixel mask with Segment Anything SAM while calculating loss. (c) Furthermore, we augment random rotation on RGB before passing it into Depth Anything, as well as on predictions from the 360 depth model.
  • Figure 3: Valid Pixel Masking. We used Grounded-Segment-Anything groundedSAM to mask out invalid pixels based on two text prompts: "sky" and "watermark." These regions lack depth sensor ground truth labels in all previous datasets. Unlike Depth Anything depthanything, which sets sky regions as 0 disparity, we follow ground truth training to ignore these regions during training for two reasons: (1) segmentation may misclassify and set other regions as zero, leading to noisy labeling, and (2) watermarks are post-processing regions that lack geometrical meaning.
  • Figure 4: Qualitative visualization of a model trained directly on pseudo equirectangular data without scale alignment. We propose calculating the loss with pseudo ground truth on cube faces due to scale misalignment between the six faces during the cube-to-equirectangular projection. We showcase the results of a model trained on pseudo equirectangular data without scale alignment as a simple baseline to demonstrate the importance of calculating loss separately on each of the six faces. The images are presented from top to bottom as follows: (a) RGB images. (b) Pseudo cube ground truth projected directly to equirectangular. (c) Prediction trained with row 2. (d) Pseudo cube ground truth with rotation projected directly to equirectangular. (e) Prediction trained with row 4. (f) Our model's predictions are trained on cube faces separately with rotation.
  • Figure 5: Cube Artifact. Shown in the center row of the figure, an undesired cube artifact appears when we apply joint training with pseudo ground truth from Depth Anything depthanything directly. This issue arises from independent relative distances within each cube face caused by a static point of view. Ignoring cross-cube relationships results in poor knowledge distillation. To address this, as shown in Figure \ref{['fig:pipeline']}(c), we randomly rotate the RGB image before inputting it into Depth Anything. This enables better distillation of depth information from varying perspectives within the equirectangular image.
  • ...and 7 more figures