Table of Contents
Fetching ...

PatchRefiner: Leveraging Synthetic Data for Real-Domain High-Resolution Monocular Metric Depth Estimation

Zhenyu Li, Shariq Farooq Bhat, Peter Wonka

TL;DR

PatchRefiner tackles the challenge of real-domain, high-resolution monocular depth estimation by reframing it as refinement of a coarse prediction in a tile-based pipeline. It combines a frozen coarse-depth model with a learnable refiner that outputs a depth residual, guided by a teacher–student setup where synthetic data provides sharp details via pseudo labels and a Detail and Scale Disentangling loss, $\mathcal{L}_{DSD}$, preserves real-domain scale. The key contributions are the residual refinement architecture and the $\mathcal{L}_{DSD}$ loss, which blends $\mathcal{L}_{silog}$, $\mathcal{L}_{rank}$, and $\mathcal{L}_{ssi}$ to improve boundary fidelity while maintaining scale accuracy. Empirically, the method achieves substantial gains on UnrealStereo4K (RMSE down by 18.1%, REL down by 15.7%) and notable boundary and scale improvements across CityScapes, ScanNet++, and ETH3D, indicating strong synthetic-to-real generalization for high-resolution depth maps with sharp edges. This work advances practical, high-resolution depth sensing for applications like autonomous driving and 3D reconstruction by delivering accurate, edge-preserving depth with robust scale adherence in real-world scenarios.

Abstract

This paper introduces PatchRefiner, an advanced framework for metric single image depth estimation aimed at high-resolution real-domain inputs. While depth estimation is crucial for applications such as autonomous driving, 3D generative modeling, and 3D reconstruction, achieving accurate high-resolution depth in real-world scenarios is challenging due to the constraints of existing architectures and the scarcity of detailed real-world depth data. PatchRefiner adopts a tile-based methodology, reconceptualizing high-resolution depth estimation as a refinement process, which results in notable performance enhancements. Utilizing a pseudo-labeling strategy that leverages synthetic data, PatchRefiner incorporates a Detail and Scale Disentangling (DSD) loss to enhance detail capture while maintaining scale accuracy, thus facilitating the effective transfer of knowledge from synthetic to real-world data. Our extensive evaluations demonstrate PatchRefiner's superior performance, significantly outperforming existing benchmarks on the Unreal4KStereo dataset by 18.1% in terms of the root mean squared error (RMSE) and showing marked improvements in detail accuracy and consistent scale estimation on diverse real-world datasets like CityScape, ScanNet++, and ETH3D.

PatchRefiner: Leveraging Synthetic Data for Real-Domain High-Resolution Monocular Metric Depth Estimation

TL;DR

PatchRefiner tackles the challenge of real-domain, high-resolution monocular depth estimation by reframing it as refinement of a coarse prediction in a tile-based pipeline. It combines a frozen coarse-depth model with a learnable refiner that outputs a depth residual, guided by a teacher–student setup where synthetic data provides sharp details via pseudo labels and a Detail and Scale Disentangling loss, , preserves real-domain scale. The key contributions are the residual refinement architecture and the loss, which blends , , and to improve boundary fidelity while maintaining scale accuracy. Empirically, the method achieves substantial gains on UnrealStereo4K (RMSE down by 18.1%, REL down by 15.7%) and notable boundary and scale improvements across CityScapes, ScanNet++, and ETH3D, indicating strong synthetic-to-real generalization for high-resolution depth maps with sharp edges. This work advances practical, high-resolution depth sensing for applications like autonomous driving and 3D reconstruction by delivering accurate, edge-preserving depth with robust scale adherence in real-world scenarios.

Abstract

This paper introduces PatchRefiner, an advanced framework for metric single image depth estimation aimed at high-resolution real-domain inputs. While depth estimation is crucial for applications such as autonomous driving, 3D generative modeling, and 3D reconstruction, achieving accurate high-resolution depth in real-world scenarios is challenging due to the constraints of existing architectures and the scarcity of detailed real-world depth data. PatchRefiner adopts a tile-based methodology, reconceptualizing high-resolution depth estimation as a refinement process, which results in notable performance enhancements. Utilizing a pseudo-labeling strategy that leverages synthetic data, PatchRefiner incorporates a Detail and Scale Disentangling (DSD) loss to enhance detail capture while maintaining scale accuracy, thus facilitating the effective transfer of knowledge from synthetic to real-world data. Our extensive evaluations demonstrate PatchRefiner's superior performance, significantly outperforming existing benchmarks on the Unreal4KStereo dataset by 18.1% in terms of the root mean squared error (RMSE) and showing marked improvements in detail accuracy and consistent scale estimation on diverse real-world datasets like CityScape, ScanNet++, and ETH3D.
Paper Structure (19 sections, 10 equations, 12 figures, 6 tables)

This paper contains 19 sections, 10 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: Framework Comparison. (a) Low resolution depth estimation framework with single forward pass. (b) Fusion-based high-resolution framework combining the best of coarse and fine depth predictions li2023patchfusionmiangoleh2021boostingdepth. (c) Our refiner-based framework predicts a residual to refine the coarse prediction.
  • Figure 2: Architecture Illustration. PatchRefiner contains a pre-trained frozen coarse depth estimation model $\mathcal{N}_c$ and a refiner model $\mathcal{N}_r$ predicts residual depth map $\mathcal{D}_r$ to refine the coarse depth $\mathcal{D}_c$. The refiner contains one base depth model $\mathcal{N}_d$ that has the same architecture as $\mathcal{N}_c$, and a light-weight decoder to aggregate information and make the final prediction.
  • Figure 3: Visualization of Real-Domain Data Pairs. Points lacking ground-truth data are depicted in gray. Due to sparse annotations near edges, models trained on real-domain data exhibit blurred boundary estimations.
  • Figure 4: Enhancing Real-Domain Learning with Synthetic Data. A teacher model trained on synthetic data produces pseudo labels for real-domain training. The student model benefits from a DSD dual-supervision approach: loss on pseudo labels for detail enhancement and loss on ground truth for scale accuracy. This method ensures detailed depth perception without compromising scale accuracy.
  • Figure 5: Qualitative Comparison on UnrealStereo4K. We show the depth prediction and corresponding error map, respectively. The qualitative comparisons showcased here indicate our framework outperforms counterparts bhat2023zoedepthli2023patchfusion with sharper edges and lower error around boundaries. We show individual patches in all images to emphasize details near depth boundaries.
  • ...and 7 more figures