Table of Contents
Fetching ...

PatchRefiner V2: Fast and Lightweight Real-Domain High-Resolution Metric Depth Estimation

Zhenyu Li, Wenqing Cui, Shariq Farooq Bhat, Peter Wonka

TL;DR

PatchRefiner V2 tackles the need for fast, high-resolution monocular depth estimation by replacing a heavy refiner with lightweight encoders and introducing a bidirectional refinement scheme. The Coarse-to-Fine (C2F) module uses Guided Denoising Units to denoise and align refiner features with coarse depth cues, while the Fine-to-Coarse (F2C) pathway injects high-frequency detail; the denoising operation follows $M_w = \sigma(\text{CB}(\text{Cat}(f_c,f_s)))$ and $f_d = M_w \otimes f_s$. To enhance synthetic-to-real transfer, SSIGM replaces SSI: ${\mathcal L}_{ssigm} = \frac{1}{M} \sum_{i=1}^{M} (|\nabla_x R_i| + |\nabla_y R_i|)$ with $R_i = \hat{d}_i - \hat{d}_i^*$ and $\hat{d}^*$ derived after scale-shift alignment. A Noisy Pretraining (NP) regime pretrains the refiner with random coarse features $N(0,1)$, enabling end-to-end training with a lightweight refiner that maintains strong depth boundary delineation; results show PRV2 achieving state-of-the-art RMSE on UnrealStereo4K and improved boundary metrics on CityScapes, ScanNet++, and KITTI with far fewer parameters and faster inference.

Abstract

While current high-resolution depth estimation methods achieve strong results, they often suffer from computational inefficiencies due to reliance on heavyweight models and multiple inference steps, increasing inference time. To address this, we introduce PatchRefiner V2 (PRV2), which replaces heavy refiner models with lightweight encoders. This reduces model size and inference time but introduces noisy features. To overcome this, we propose a Coarse-to-Fine (C2F) module with a Guided Denoising Unit for refining and denoising the refiner features and a Noisy Pretraining strategy to pretrain the refiner branch to fully exploit the potential of the lightweight refiner branch. Additionally, we introduce a Scale-and-Shift Invariant Gradient Matching (SSIGM) loss to enhance synthetic-to-real domain transfer. PRV2 outperforms state-of-the-art depth estimation methods on UnrealStereo4K in both accuracy and speed, using fewer parameters and faster inference. It also shows improved depth boundary delineation on real-world datasets like CityScape, ScanNet++, and KITTI, demonstrating its versatility across domains.

PatchRefiner V2: Fast and Lightweight Real-Domain High-Resolution Metric Depth Estimation

TL;DR

PatchRefiner V2 tackles the need for fast, high-resolution monocular depth estimation by replacing a heavy refiner with lightweight encoders and introducing a bidirectional refinement scheme. The Coarse-to-Fine (C2F) module uses Guided Denoising Units to denoise and align refiner features with coarse depth cues, while the Fine-to-Coarse (F2C) pathway injects high-frequency detail; the denoising operation follows and . To enhance synthetic-to-real transfer, SSIGM replaces SSI: with and derived after scale-shift alignment. A Noisy Pretraining (NP) regime pretrains the refiner with random coarse features , enabling end-to-end training with a lightweight refiner that maintains strong depth boundary delineation; results show PRV2 achieving state-of-the-art RMSE on UnrealStereo4K and improved boundary metrics on CityScapes, ScanNet++, and KITTI with far fewer parameters and faster inference.

Abstract

While current high-resolution depth estimation methods achieve strong results, they often suffer from computational inefficiencies due to reliance on heavyweight models and multiple inference steps, increasing inference time. To address this, we introduce PatchRefiner V2 (PRV2), which replaces heavy refiner models with lightweight encoders. This reduces model size and inference time but introduces noisy features. To overcome this, we propose a Coarse-to-Fine (C2F) module with a Guided Denoising Unit for refining and denoising the refiner features and a Noisy Pretraining strategy to pretrain the refiner branch to fully exploit the potential of the lightweight refiner branch. Additionally, we introduce a Scale-and-Shift Invariant Gradient Matching (SSIGM) loss to enhance synthetic-to-real domain transfer. PRV2 outperforms state-of-the-art depth estimation methods on UnrealStereo4K in both accuracy and speed, using fewer parameters and faster inference. It also shows improved depth boundary delineation on real-world datasets like CityScape, ScanNet++, and KITTI, demonstrating its versatility across domains.
Paper Structure (20 sections, 5 equations, 6 figures, 7 tables)

This paper contains 20 sections, 5 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: UnrealStereo4K results. PatchRefiner V2 (PRV2) significantly outperforms previous high-resolution frameworks. In particular, PRV2$_{C}$ achieves new SOTA RMSE but being 2.3x faster than PR. PRV2$_{M}$ is 9.2x smaller and 10.7x faster than PF. PF and PR are short for PatchFusion li2023patchfusion and PatchRefiner li2024patchrefiner, respectively. We present the comparison of PR and PRV2 in Fig. \ref{['fig:arch_compare']}.
  • Figure 2: A comparison of (a) PatchRefiner and (b) our proposed PatchRefiner V2. We adopt a lightweight encoder for the refiner branch, which alleviates the inference speed bottleneck, reduces the number of parameters for high-resolution estimation, and facilitates end-to-end training. A novel coarse-to-fine (C2F) module is proposed to denoise features from the lite model and further boost performance.
  • Figure 3: Visualization of F2C input feature maps. We showcase the first 16 channels of the F2C input features. (c) Without the C2F module (setting ③ in Tab. \ref{['tab:arch_ablation']}), the refiner features are 'noisy' and hard to interpret. (d) The C2F module helps denoise the refiner features, leading to clear boundaries and better results.
  • Figure 4: Left: Coarse-to-Fine (C2F) module overview. It processes refiner features in a bottom-to-top manner with $N$ successive C2F layers. Each layer is guided by coarse features with corresponding resolution and outputs denoised features for the Fine-to-Coarse (F2C) module. Center: C2F layers combine multi-level features with Residual Convolutional Units lin2017refinenetRanftl2022midas and denoises the features using Guided Denoising Units (GDU). Right: Guidance information from the coarse branch is introduced through a concatenation followed by a convolutional block and then converted to a weight map ranging from 0 to 1 through the sigmoid operator. We then adopt an elementwise multiplication to denoise the shortcut feature.
  • Figure 5: Qualitative Comparison on UnrealStereo4K. We show the depth prediction and corresponding error map, respectively. The qualitative comparisons showcased here indicate our PRV2$_{\textsc{C}}$ outperforms counterparts bhat2023zoedepthli2024patchrefiner with sharper edges and lower error around boundaries while achieving faster inference. We show individual patches in all images to emphasize details near depth boundaries.
  • ...and 1 more figures