PatchRefiner: Leveraging Synthetic Data for Real-Domain High-Resolution Monocular Metric Depth Estimation
Zhenyu Li, Shariq Farooq Bhat, Peter Wonka
TL;DR
PatchRefiner tackles the challenge of real-domain, high-resolution monocular depth estimation by reframing it as refinement of a coarse prediction in a tile-based pipeline. It combines a frozen coarse-depth model with a learnable refiner that outputs a depth residual, guided by a teacher–student setup where synthetic data provides sharp details via pseudo labels and a Detail and Scale Disentangling loss, $\mathcal{L}_{DSD}$, preserves real-domain scale. The key contributions are the residual refinement architecture and the $\mathcal{L}_{DSD}$ loss, which blends $\mathcal{L}_{silog}$, $\mathcal{L}_{rank}$, and $\mathcal{L}_{ssi}$ to improve boundary fidelity while maintaining scale accuracy. Empirically, the method achieves substantial gains on UnrealStereo4K (RMSE down by 18.1%, REL down by 15.7%) and notable boundary and scale improvements across CityScapes, ScanNet++, and ETH3D, indicating strong synthetic-to-real generalization for high-resolution depth maps with sharp edges. This work advances practical, high-resolution depth sensing for applications like autonomous driving and 3D reconstruction by delivering accurate, edge-preserving depth with robust scale adherence in real-world scenarios.
Abstract
This paper introduces PatchRefiner, an advanced framework for metric single image depth estimation aimed at high-resolution real-domain inputs. While depth estimation is crucial for applications such as autonomous driving, 3D generative modeling, and 3D reconstruction, achieving accurate high-resolution depth in real-world scenarios is challenging due to the constraints of existing architectures and the scarcity of detailed real-world depth data. PatchRefiner adopts a tile-based methodology, reconceptualizing high-resolution depth estimation as a refinement process, which results in notable performance enhancements. Utilizing a pseudo-labeling strategy that leverages synthetic data, PatchRefiner incorporates a Detail and Scale Disentangling (DSD) loss to enhance detail capture while maintaining scale accuracy, thus facilitating the effective transfer of knowledge from synthetic to real-world data. Our extensive evaluations demonstrate PatchRefiner's superior performance, significantly outperforming existing benchmarks on the Unreal4KStereo dataset by 18.1% in terms of the root mean squared error (RMSE) and showing marked improvements in detail accuracy and consistent scale estimation on diverse real-world datasets like CityScape, ScanNet++, and ETH3D.
