Table of Contents
Fetching ...

SM4Depth: Seamless Monocular Metric Depth Estimation across Multiple Cameras and Scenes by One Model

Yihao Liu, Feng Xue, Anlong Ming, Mingshuai Zhao, Huadong Ma, Nicu Sebe

TL;DR

This work addresses the persistent generalization and data-efficiency gap in monocular metric depth estimation by introducing SM^4Depth, a single model that generalizes across indoor and outdoor scenes without scene-specific parameters. It proposes variation-based unnormalized depth bins to mitigate bin-ambiguity across large depth ranges and a divide-and-conquer, domain-aware bin estimation to learn metric bins from multiple sub-spaces, reducing the need for massive training data. The authors also introduce BUPT Depth, an uncut RGB-D dataset for evaluating depth accuracy consistency across scenes. With a Swin-Transformer backbone on a single RTX 3090 and training on about 150K RGB-D pairs, SM^4Depth achieves state-of-the-art-like performance on unseen datasets and maintains accuracy across indoor/outdoor domains, highlighting its practicality for real-world MMDE tasks.

Abstract

In the last year, universal monocular metric depth estimation (universal MMDE) has gained considerable attention, serving as the foundation model for various multimedia tasks, such as video and image editing. Nonetheless, current approaches face challenges in maintaining consistent accuracy across diverse scenes without scene-specific parameters and pre-training, hindering the practicality of MMDE. Furthermore, these methods rely on extensive datasets comprising millions, if not tens of millions, of data for training, leading to significant time and hardware expenses. This paper presents SM$^4$Depth, a model that seamlessly works for both indoor and outdoor scenes, without needing extensive training data and GPU clusters. Firstly, to obtain consistent depth across diverse scenes, we propose a novel metric scale modeling, i.e., variation-based unnormalized depth bins. It reduces the ambiguity of the conventional metric bins and enables better adaptation to large depth gaps of scenes during training. Secondly, we propose a "divide and conquer" solution to reduce reliance on massive training data. Instead of estimating directly from the vast solution space, the metric bins are estimated from multiple solution sub-spaces to reduce complexity. Additionally, we introduce an uncut depth dataset, BUPT Depth, to evaluate the depth accuracy and consistency across various indoor and outdoor scenes. Trained on a consumer-grade GPU using just 150K RGB-D pairs, SM$^4$Depth achieves outstanding performance on the most never-before-seen datasets, especially maintaining consistent accuracy across indoors and outdoors. The code can be found https://github.com/mRobotit/SM4Depth.

SM4Depth: Seamless Monocular Metric Depth Estimation across Multiple Cameras and Scenes by One Model

TL;DR

This work addresses the persistent generalization and data-efficiency gap in monocular metric depth estimation by introducing SM^4Depth, a single model that generalizes across indoor and outdoor scenes without scene-specific parameters. It proposes variation-based unnormalized depth bins to mitigate bin-ambiguity across large depth ranges and a divide-and-conquer, domain-aware bin estimation to learn metric bins from multiple sub-spaces, reducing the need for massive training data. The authors also introduce BUPT Depth, an uncut RGB-D dataset for evaluating depth accuracy consistency across scenes. With a Swin-Transformer backbone on a single RTX 3090 and training on about 150K RGB-D pairs, SM^4Depth achieves state-of-the-art-like performance on unseen datasets and maintains accuracy across indoor/outdoor domains, highlighting its practicality for real-world MMDE tasks.

Abstract

In the last year, universal monocular metric depth estimation (universal MMDE) has gained considerable attention, serving as the foundation model for various multimedia tasks, such as video and image editing. Nonetheless, current approaches face challenges in maintaining consistent accuracy across diverse scenes without scene-specific parameters and pre-training, hindering the practicality of MMDE. Furthermore, these methods rely on extensive datasets comprising millions, if not tens of millions, of data for training, leading to significant time and hardware expenses. This paper presents SMDepth, a model that seamlessly works for both indoor and outdoor scenes, without needing extensive training data and GPU clusters. Firstly, to obtain consistent depth across diverse scenes, we propose a novel metric scale modeling, i.e., variation-based unnormalized depth bins. It reduces the ambiguity of the conventional metric bins and enables better adaptation to large depth gaps of scenes during training. Secondly, we propose a "divide and conquer" solution to reduce reliance on massive training data. Instead of estimating directly from the vast solution space, the metric bins are estimated from multiple solution sub-spaces to reduce complexity. Additionally, we introduce an uncut depth dataset, BUPT Depth, to evaluate the depth accuracy and consistency across various indoor and outdoor scenes. Trained on a consumer-grade GPU using just 150K RGB-D pairs, SMDepth achieves outstanding performance on the most never-before-seen datasets, especially maintaining consistent accuracy across indoors and outdoors. The code can be found https://github.com/mRobotit/SM4Depth.
Paper Structure (22 sections, 7 equations, 7 figures, 5 tables)

This paper contains 22 sections, 7 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Two bin-center curves on (a) a distant-view image with range $[0m,30m]$ and (b) a close-up one with range $[0m,3.5m]$ from iBims-1 ibims. (c) represents probability maps $P$ corresponding to the red curve in (b).
  • Figure 2: The heatmaps show the frequency of depth values occurring in each depth bin, which are obtained from iBims-1 ibims. If a square $(\mathcal{X},\mathcal{Y})$ appears darker, it indicates that the depth value $\mathcal{Y}$ mainly occurs within the $\mathcal{X}^\text{th}$ depth bin.
  • Figure 3: SM$^4$Depth Pipeline containing the domain-aware bin estimation (blue mask) and the HSC-decoder (red mask).
  • Figure 4: Top view of the scene where we collected the BUPT Depth dataset. Red lines indicate indoor scenes and purple lines indicate outdoor scenes. We give five images and their ground truth depth calculated by CREStereo CRESTEREO2022 as examples.
  • Figure 5: RMSE per frame of SM$^4$Depth (orange), Metirc3D (gray), ZoeDepth-NK (blue), UniDepth (yellow), DepthAnything-NK(purple) on BUPT Depth. We use the stereo depth of ZED2 (the first chart) and CREStereoCRESTEREO2022 (the second chart) as ground truth, respectively. Gray indicates outdoor frames, and white indicates indoor frames.
  • ...and 2 more figures