Table of Contents
Fetching ...

TR2M: Transferring Monocular Relative Depth to Metric Depth with Language Descriptions and Dual-Level Scale-Oriented Contrast

Beilei Cui, Yiming Huang, Long Bai, Hongliang Ren

Abstract

This work presents a generalizable framework to transfer relative depth to metric depth. Current monocular depth estimation methods are mainly divided into metric depth estimation (MMDE) and relative depth estimation (MRDE). MMDEs estimate depth in metric scale but are often limited to a specific domain. MRDEs generalize well across different domains, but with uncertain scales which hinders downstream applications. To this end, we aim to build up a framework to solve scale uncertainty and transfer relative depth to metric depth. Previous methods used language as input and estimated two factors for conducting rescaling. Our approach, TR2M, utilizes both text description and image as inputs and estimates two rescale maps to transfer relative depth to metric depth at pixel level. Features from two modalities are fused with a cross-modality attention module to better capture scale information. A strategy is designed to construct and filter confident pseudo metric depth for more comprehensive supervision. We also develop scale-oriented contrastive learning to utilize depth distribution as guidance to enforce the model learning about intrinsic knowledge aligning with the scale distribution. TR2M only exploits a small number of trainable parameters to train on datasets in various domains and experiments not only demonstrate TR2M's great performance in seen datasets but also reveal superior zero-shot capabilities on five unseen datasets. We show the huge potential in pixel-wise transferring relative depth to metric depth with language assistance. (Code is available at: https://github.com/BeileiCui/TR2M)

TR2M: Transferring Monocular Relative Depth to Metric Depth with Language Descriptions and Dual-Level Scale-Oriented Contrast

Abstract

This work presents a generalizable framework to transfer relative depth to metric depth. Current monocular depth estimation methods are mainly divided into metric depth estimation (MMDE) and relative depth estimation (MRDE). MMDEs estimate depth in metric scale but are often limited to a specific domain. MRDEs generalize well across different domains, but with uncertain scales which hinders downstream applications. To this end, we aim to build up a framework to solve scale uncertainty and transfer relative depth to metric depth. Previous methods used language as input and estimated two factors for conducting rescaling. Our approach, TR2M, utilizes both text description and image as inputs and estimates two rescale maps to transfer relative depth to metric depth at pixel level. Features from two modalities are fused with a cross-modality attention module to better capture scale information. A strategy is designed to construct and filter confident pseudo metric depth for more comprehensive supervision. We also develop scale-oriented contrastive learning to utilize depth distribution as guidance to enforce the model learning about intrinsic knowledge aligning with the scale distribution. TR2M only exploits a small number of trainable parameters to train on datasets in various domains and experiments not only demonstrate TR2M's great performance in seen datasets but also reveal superior zero-shot capabilities on five unseen datasets. We show the huge potential in pixel-wise transferring relative depth to metric depth with language assistance. (Code is available at: https://github.com/BeileiCui/TR2M)

Paper Structure

This paper contains 23 sections, 10 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Illustration of motivation. (a) Metric Depth Estimation typically restricts to a single domain or relies on camera parameters or sensors to enhance generalization. (b) Relative Depth Estimation generalizes better but is ambiguous in scale. (c) We therefore seek to transfer generalizable relative depth to metric depth by pixel-wise rescaling maps given image and readily accessible text description, which succeeds in obtaining metric depth for various domains with one lightweight trainable architecture. (d) Qualitative examples.
  • Figure 2: Illustration of the proposed TR2M framework. Image and text embedding features are first obtained with separate frozen encoders, and a cross-modality attention module is proposed to integrate them. The rescale maps are predicted with different decoders to transfer relative depth to metric depth. Ground truth depth map and aligned pseudo metric depth map are utilized. Scale-Oriented Contrast is applied to the final embedding features.
  • Figure 3: Illustration of the proposed Dual-Level Scale Oriented Contrast Learning. The scale-oriented contrast enables embedding features more consistent with the scale and depth distribution within different levels, thus enhancing scale perception capability.
  • Figure 4: Qualitative results on NYUv2. Our method consistently produces better predictions with much less error. $\Delta$ denotes $Abs Rel$ ranging from lowest (Black) to highest (Red).
  • Figure 5: The value of the relative depth map within the red rectangles is inconsistent with the ground truth. Our method can correct such errors when transferring to metric depth.