Table of Contents
Fetching ...

The Devil is in the Edges: Monocular Depth Estimation with Edge-aware Consistency Fusion

Pengzhi Li, Yikang Ding, Haohan Wang, Chengshuai Tang, Zhiheng Li

TL;DR

Monocular depth estimation often misses fine edges; this paper shows edge information is a key cue for high-frequency depth detail. It proposes ECFNet, an edge-aware consistency fusion framework with a hybrid edge detection strategy, a layered fusion module, and a depth consistency module to fuse and refine three initial depth maps derived from the RGB image, an edge map, and an edge-highlighted image. Across three public datasets, ECFNet achieves state-of-the-art results, particularly in edge depth accuracy and robustness to degraded inputs. The approach enables improved edge fidelity and structured depth maps for downstream tasks, including cross-domain image editing with edge-preserving depth.

Abstract

This paper presents a novel monocular depth estimation method, named ECFNet, for estimating high-quality monocular depth with clear edges and valid overall structure from a single RGB image. We make a thorough inquiry about the key factor that affects the edge depth estimation of the MDE networks, and come to a ratiocination that the edge information itself plays a critical role in predicting depth details. Driven by this analysis, we propose to explicitly employ the image edges as input for ECFNet and fuse the initial depths from different sources to produce the final depth. Specifically, ECFNet first uses a hybrid edge detection strategy to get the edge map and edge-highlighted image from the input image, and then leverages a pre-trained MDE network to infer the initial depths of the aforementioned three images. After that, ECFNet utilizes a layered fusion module (LFM) to fuse the initial depth, which will be further updated by a depth consistency module (DCM) to form the final estimation. Extensive experimental results on public datasets and ablation studies indicate that our method achieves state-of-the-art performance. Project page: https://zrealli.github.io/edgedepth.

The Devil is in the Edges: Monocular Depth Estimation with Edge-aware Consistency Fusion

TL;DR

Monocular depth estimation often misses fine edges; this paper shows edge information is a key cue for high-frequency depth detail. It proposes ECFNet, an edge-aware consistency fusion framework with a hybrid edge detection strategy, a layered fusion module, and a depth consistency module to fuse and refine three initial depth maps derived from the RGB image, an edge map, and an edge-highlighted image. Across three public datasets, ECFNet achieves state-of-the-art results, particularly in edge depth accuracy and robustness to degraded inputs. The approach enables improved edge fidelity and structured depth maps for downstream tasks, including cross-domain image editing with edge-preserving depth.

Abstract

This paper presents a novel monocular depth estimation method, named ECFNet, for estimating high-quality monocular depth with clear edges and valid overall structure from a single RGB image. We make a thorough inquiry about the key factor that affects the edge depth estimation of the MDE networks, and come to a ratiocination that the edge information itself plays a critical role in predicting depth details. Driven by this analysis, we propose to explicitly employ the image edges as input for ECFNet and fuse the initial depths from different sources to produce the final depth. Specifically, ECFNet first uses a hybrid edge detection strategy to get the edge map and edge-highlighted image from the input image, and then leverages a pre-trained MDE network to infer the initial depths of the aforementioned three images. After that, ECFNet utilizes a layered fusion module (LFM) to fuse the initial depth, which will be further updated by a depth consistency module (DCM) to form the final estimation. Extensive experimental results on public datasets and ablation studies indicate that our method achieves state-of-the-art performance. Project page: https://zrealli.github.io/edgedepth.
Paper Structure (23 sections, 5 equations, 19 figures, 4 tables)

This paper contains 23 sections, 5 equations, 19 figures, 4 tables.

Figures (19)

  • Figure 1: Depth visualizations on NYU-v2 nyu2012indoor. For each triplet in the firsr row, we showcase the RGB image, the edge-highlighted map, the edge map, and their corresponding depth maps predicted by pre-trained DPT Ranftl2021. The edge maps are obtained using the hybrid edge detection strategy. In the second row, we show more edge observations in degraded images or cross-domain applications.
  • Figure 2: Pipeline of ECFNet. Given an image, ECFNet first extracts the edge map and computes the edge-highlighted image by removing edge pixels. These three images (including the original image) are fed into a frozen MDE network to predict initial depth maps as well. Subsequently, the initial depths are fused using LFM. Finally, the DCM is used to reduce the errors in fused depth from LFM and improve the depth consistency between the final depth and initial depth.
  • Figure 3: Comparison of different edge maps and their corresponding depth maps. (a) the RGB image, (b) the Sobel edge map sobel1983accuracy, (c) our edge map, and (d) the ground truth edge map. As the quality of the edge map improves, the corresponding depth map can capture more depth details (ignoring structural distortion).
  • Figure 4: Visualized depth results before and after DCM. (a)-(c) represent the depth maps with different resolutions, while (d) displays depth slices of the green regions in the sample sequence. DCM helps significantly reduce the inconsistency between the input depth maps.
  • Figure 5: Visualization comparison with base models (DPT Ranftl2021 and LeReS yin2021learning) and related fusion methods (BMD Miangoleh2021Boosting and GDF dai2022multi) on IBims-1 ibim2018evaluation and NYU-v2 nyu2012indoor datasets. Original indicates the depth from base models, HR indicates the depth of high-resolution inputs from base models. The left four columns use DPT Ranftl2021 as the base MDE model, and the right three columns use LeRes yin2021learning.
  • ...and 14 more figures