Table of Contents
Fetching ...

MultiDepth: Multi-Sample Priors for Refining Monocular Metric Depth Estimations in Indoor Scenes

Sanghyun Byun, Jacob Song, Woo Seong Chung

TL;DR

This work proposes a solution by taking samples of the image along with the initial depth map prediction made by a pre-trained MMDE model, and implements a lightweight encoder-decoder architecture for the refinement network, processing multiple samples from the given image, including segmentation masking.

Abstract

Monocular metric depth estimation (MMDE) is a crucial task to solve for indoor scene reconstruction on edge devices. Despite this importance, existing models are sensitive to factors such as boundary frequency of objects in the scene and scene complexity, failing to fully capture many indoor scenes. In this work, we propose to close this gap through the task of monocular metric depth refinement (MMDR) by leveraging state-of-the-art MMDE models. MultiDepth proposes a solution by taking samples of the image along with the initial depth map prediction made by a pre-trained MMDE model. Compared to existing iterative depth refinement techniques, MultiDepth does not employ normal map prediction as part of its architecture, effectively lowering the model size and computation overhead while outputting impactful changes from refining iterations. MultiDepth implements a lightweight encoder-decoder architecture for the refinement network, processing multiple samples from the given image, including segmentation masking. We evaluate MultiDepth on four datasets and compare them to state-of-the-art methods to demonstrate its effective refinement with minimal overhead, displaying accuracy improvement upward of 45%.

MultiDepth: Multi-Sample Priors for Refining Monocular Metric Depth Estimations in Indoor Scenes

TL;DR

This work proposes a solution by taking samples of the image along with the initial depth map prediction made by a pre-trained MMDE model, and implements a lightweight encoder-decoder architecture for the refinement network, processing multiple samples from the given image, including segmentation masking.

Abstract

Monocular metric depth estimation (MMDE) is a crucial task to solve for indoor scene reconstruction on edge devices. Despite this importance, existing models are sensitive to factors such as boundary frequency of objects in the scene and scene complexity, failing to fully capture many indoor scenes. In this work, we propose to close this gap through the task of monocular metric depth refinement (MMDR) by leveraging state-of-the-art MMDE models. MultiDepth proposes a solution by taking samples of the image along with the initial depth map prediction made by a pre-trained MMDE model. Compared to existing iterative depth refinement techniques, MultiDepth does not employ normal map prediction as part of its architecture, effectively lowering the model size and computation overhead while outputting impactful changes from refining iterations. MultiDepth implements a lightweight encoder-decoder architecture for the refinement network, processing multiple samples from the given image, including segmentation masking. We evaluate MultiDepth on four datasets and compare them to state-of-the-art methods to demonstrate its effective refinement with minimal overhead, displaying accuracy improvement upward of 45%.

Paper Structure

This paper contains 19 sections, 7 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: UniDepth unidepth prediction inconsistency between pixel-unshuffled images. (d) shows L1 loss between ground-truth and Gaussian-blurred image for forgiveness, and (e) shows L1 loss between ground-truth and pixel-shuffled image of prediction on the pixel-unshuffled original image. PUD stands for pixel-unshuffle downscaling.
  • Figure 2: UniDepth unidepth prediction inconsistency between sub-sampled images. (a) shows ground-truth sub-sample. (b) shows UniDepth prediction on a sub-sampled image. (c) shows L1 loss between ground truth and depth prediction, where green indicates greater loss.
  • Figure 3: Model Architecture The image is first processed through a pre-trained D-Net (monocular metric depth estimation model) unidepthmonodepth2superprimitive to generate an initial coarse depth map from a single RGB image. This RGB-D image is then processed into three sampling pipelines: segmentation (using SAM2 sam2), random subsampling, and pixel-unshuffling with scale $s_{pud}$. The processed images are optionally passed through a super-resolution module for resolution matching. All processed images and the original RGB-D are passed into R-Net, a UNet unet architecture with ResNet-like resnet encoder and multi-scale decoder. To generate refined depth maps at different view samples. The multiple outputs are processed in the MRCM, aggregating them into a new depth map that can be fed into the model for iterative refinement.
  • Figure 4: Qualitative Results Comparison. ViT version of Unidepth unidepth, ZoeDepth zoedepth, and 1/5/10 iterations MultiDepth are tested on NYUv2 nyuv2 and Diode Indoor diode. Point cloud is shown at the top, and depth map predictions are shown at the bottom. The point cloud projection is done following the Equation \ref{['eq:unproject']}. Predicted intrinsic matrix is used for UniDepth unidepth and MultiDepth, and dataset-provided intrinsic matrix is used for ZoeDepth zoedepth. Best viewed on screen and zoomed in.
  • Figure 5: Qualitative Depth Refinement over Iterations Changes to aggregate depth map over multiple iterations are analyzed with iteration counts of 0, 1, 10, 20, and 30 on NYUv2 nyuv2, Diode Indoor diode, and ETH3D Indoor eth3d test data. Iteration 0 is equivalent to UniDepth unidepth output.