PatchRefiner V2: Fast and Lightweight Real-Domain High-Resolution Metric Depth Estimation
Zhenyu Li, Wenqing Cui, Shariq Farooq Bhat, Peter Wonka
TL;DR
PatchRefiner V2 tackles the need for fast, high-resolution monocular depth estimation by replacing a heavy refiner with lightweight encoders and introducing a bidirectional refinement scheme. The Coarse-to-Fine (C2F) module uses Guided Denoising Units to denoise and align refiner features with coarse depth cues, while the Fine-to-Coarse (F2C) pathway injects high-frequency detail; the denoising operation follows $M_w = \sigma(\text{CB}(\text{Cat}(f_c,f_s)))$ and $f_d = M_w \otimes f_s$. To enhance synthetic-to-real transfer, SSIGM replaces SSI: ${\mathcal L}_{ssigm} = \frac{1}{M} \sum_{i=1}^{M} (|\nabla_x R_i| + |\nabla_y R_i|)$ with $R_i = \hat{d}_i - \hat{d}_i^*$ and $\hat{d}^*$ derived after scale-shift alignment. A Noisy Pretraining (NP) regime pretrains the refiner with random coarse features $N(0,1)$, enabling end-to-end training with a lightweight refiner that maintains strong depth boundary delineation; results show PRV2 achieving state-of-the-art RMSE on UnrealStereo4K and improved boundary metrics on CityScapes, ScanNet++, and KITTI with far fewer parameters and faster inference.
Abstract
While current high-resolution depth estimation methods achieve strong results, they often suffer from computational inefficiencies due to reliance on heavyweight models and multiple inference steps, increasing inference time. To address this, we introduce PatchRefiner V2 (PRV2), which replaces heavy refiner models with lightweight encoders. This reduces model size and inference time but introduces noisy features. To overcome this, we propose a Coarse-to-Fine (C2F) module with a Guided Denoising Unit for refining and denoising the refiner features and a Noisy Pretraining strategy to pretrain the refiner branch to fully exploit the potential of the lightweight refiner branch. Additionally, we introduce a Scale-and-Shift Invariant Gradient Matching (SSIGM) loss to enhance synthetic-to-real domain transfer. PRV2 outperforms state-of-the-art depth estimation methods on UnrealStereo4K in both accuracy and speed, using fewer parameters and faster inference. It also shows improved depth boundary delineation on real-world datasets like CityScape, ScanNet++, and KITTI, demonstrating its versatility across domains.
