Table of Contents
Fetching ...

HybridDepth: Robust Metric Depth Fusion by Leveraging Depth from Focus and Single-Image Priors

Ashkan Ganj, Hang Su, Tian Guo

TL;DR

HybridDepth tackles scale ambiguity in metric depth estimation from monocular cameras by fusing focal-stack depth-from-focus cues with a learned relative-depth prior in a three-stage pipeline: extract a relative-depth map from a single RGB image, align it to metric scale using least-squares with a DFF-derived metric depth, and refine per-pixel scales with an uncertainty-aware refinement network. The approach yields state-of-the-art results on focal-stack benchmarks (DDFF12, NYU Depth V2) and exhibits strong zero-shot generalization to ARKitScenes and Mobile Depth, while maintaining a compact model and fast inference suitable for mobile AR. Key contributions include a modular fusion scheme, an uncertainty-guided scale refinement, synthetic focal-stack data synthesis, and an end-to-end mobile pipeline with real-time inference. The work demonstrates practical impact by enabling robust, metric depth on consumer devices without LiDAR/ToF sensors, delivering improved depth fidelity in texture-poor regions and across unseen environments.

Abstract

We propose HYBRIDDEPTH, a robust depth estimation pipeline that addresses key challenges in depth estimation,including scale ambiguity, hardware heterogeneity, and generalizability. HYBRIDDEPTH leverages focal stack, data conveniently accessible in common mobile devices, to produce accurate metric depth maps. By incorporating depth priors afforded by recent advances in singleimage depth estimation, our model achieves a higher level of structural detail compared to existing methods. We test our pipeline as an end-to-end system, with a newly developed mobile client to capture focal stacks, which are then sent to a GPU-powered server for depth estimation. Comprehensive quantitative and qualitative analyses demonstrate that HYBRIDDEPTH outperforms state-of-the-art(SOTA) models on common datasets such as DDFF12 and NYU Depth V2. HYBRIDDEPTH also shows strong zero-shot generalization. When trained on NYU Depth V2, HYBRIDDEPTH surpasses SOTA models in zero-shot performance on ARKitScenes and delivers more structurally accurate depth maps on Mobile Depth. The code is available at https://github.com/cake-lab/HybridDepth/.

HybridDepth: Robust Metric Depth Fusion by Leveraging Depth from Focus and Single-Image Priors

TL;DR

HybridDepth tackles scale ambiguity in metric depth estimation from monocular cameras by fusing focal-stack depth-from-focus cues with a learned relative-depth prior in a three-stage pipeline: extract a relative-depth map from a single RGB image, align it to metric scale using least-squares with a DFF-derived metric depth, and refine per-pixel scales with an uncertainty-aware refinement network. The approach yields state-of-the-art results on focal-stack benchmarks (DDFF12, NYU Depth V2) and exhibits strong zero-shot generalization to ARKitScenes and Mobile Depth, while maintaining a compact model and fast inference suitable for mobile AR. Key contributions include a modular fusion scheme, an uncertainty-guided scale refinement, synthetic focal-stack data synthesis, and an end-to-end mobile pipeline with real-time inference. The work demonstrates practical impact by enabling robust, metric depth on consumer devices without LiDAR/ToF sensors, delivering improved depth fidelity in texture-poor regions and across unseen environments.

Abstract

We propose HYBRIDDEPTH, a robust depth estimation pipeline that addresses key challenges in depth estimation,including scale ambiguity, hardware heterogeneity, and generalizability. HYBRIDDEPTH leverages focal stack, data conveniently accessible in common mobile devices, to produce accurate metric depth maps. By incorporating depth priors afforded by recent advances in singleimage depth estimation, our model achieves a higher level of structural detail compared to existing methods. We test our pipeline as an end-to-end system, with a newly developed mobile client to capture focal stacks, which are then sent to a GPU-powered server for depth estimation. Comprehensive quantitative and qualitative analyses demonstrate that HYBRIDDEPTH outperforms state-of-the-art(SOTA) models on common datasets such as DDFF12 and NYU Depth V2. HYBRIDDEPTH also shows strong zero-shot generalization. When trained on NYU Depth V2, HYBRIDDEPTH surpasses SOTA models in zero-shot performance on ARKitScenes and delivers more structurally accurate depth maps on Mobile Depth. The code is available at https://github.com/cake-lab/HybridDepth/.
Paper Structure (25 sections, 5 equations, 9 figures, 9 tables)

This paper contains 25 sections, 5 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: HybridDepth produces globally scaled depth maps, and refines them to further correct errors and enhance details.
  • Figure 2: An overview of HybridDepth which consists of three stages: (1) capture a focal stack and pass the frames through two branches; (2) calculate scale and shift based on estimated relative and metric depth maps using least-squares fitting; (3) input a globally scaled depth map and a processed version of the Metric DFF branch output to the refinement model to output the updated scale map, which will be applied to the globally scaled depth map to get the final depth map.
  • Figure 3: HybridDepth performance in capturing small details in depth maps in comparison to DFV on DDFF12.
  • Figure 4: HybridDepth's zero-shot performance on ARKitScenes compared to DFV and Depth Anything, demonstrating improved depth accuracy and detail preservation.
  • Figure 5: Qualitative results on Mobile Depth dataset.
  • ...and 4 more figures