Table of Contents
Fetching ...

WorDepth: Variational Language Prior for Monocular Depth Estimation

Ziyao Zeng, Daniel Wang, Fengyu Yang, Hyoungseob Park, Yangchao Wu, Stefano Soatto, Byung-Woo Hong, Dong Lao, Alex Wong

TL;DR

This work addresses the scale ambiguity in monocular depth estimation by introducing language as a priors-based regularizer. It presents WorDepth, a variational framework with a text-VAE that maps captions to a distribution over plausible scene layouts and an image-conditioned sampler that grounds depth to the observed image, trained via alternating optimization. Leveraging CLIP for text features and a Swin-L-based sampler, it achieves state-of-the-art results on NYU Depth V2 and KITTI, and demonstrates zero-shot transfer to SUN-RGBD. The approach offers a principled way to fuse semantic language cues with visual evidence to produce metric-depth maps from a single image, advancing scalable 3D reconstruction in real-world settings.

Abstract

Three-dimensional (3D) reconstruction from a single image is an ill-posed problem with inherent ambiguities, i.e. scale. Predicting a 3D scene from text description(s) is similarly ill-posed, i.e. spatial arrangements of objects described. We investigate the question of whether two inherently ambiguous modalities can be used in conjunction to produce metric-scaled reconstructions. To test this, we focus on monocular depth estimation, the problem of predicting a dense depth map from a single image, but with an additional text caption describing the scene. To this end, we begin by encoding the text caption as a mean and standard deviation; using a variational framework, we learn the distribution of the plausible metric reconstructions of 3D scenes corresponding to the text captions as a prior. To "select" a specific reconstruction or depth map, we encode the given image through a conditional sampler that samples from the latent space of the variational text encoder, which is then decoded to the output depth map. Our approach is trained alternatingly between the text and image branches: in one optimization step, we predict the mean and standard deviation from the text description and sample from a standard Gaussian, and in the other, we sample using a (image) conditional sampler. Once trained, we directly predict depth from the encoded text using the conditional sampler. We demonstrate our approach on indoor (NYUv2) and outdoor (KITTI) scenarios, where we show that language can consistently improve performance in both.

WorDepth: Variational Language Prior for Monocular Depth Estimation

TL;DR

This work addresses the scale ambiguity in monocular depth estimation by introducing language as a priors-based regularizer. It presents WorDepth, a variational framework with a text-VAE that maps captions to a distribution over plausible scene layouts and an image-conditioned sampler that grounds depth to the observed image, trained via alternating optimization. Leveraging CLIP for text features and a Swin-L-based sampler, it achieves state-of-the-art results on NYU Depth V2 and KITTI, and demonstrates zero-shot transfer to SUN-RGBD. The approach offers a principled way to fuse semantic language cues with visual evidence to produce metric-depth maps from a single image, advancing scalable 3D reconstruction in real-world settings.

Abstract

Three-dimensional (3D) reconstruction from a single image is an ill-posed problem with inherent ambiguities, i.e. scale. Predicting a 3D scene from text description(s) is similarly ill-posed, i.e. spatial arrangements of objects described. We investigate the question of whether two inherently ambiguous modalities can be used in conjunction to produce metric-scaled reconstructions. To test this, we focus on monocular depth estimation, the problem of predicting a dense depth map from a single image, but with an additional text caption describing the scene. To this end, we begin by encoding the text caption as a mean and standard deviation; using a variational framework, we learn the distribution of the plausible metric reconstructions of 3D scenes corresponding to the text captions as a prior. To "select" a specific reconstruction or depth map, we encode the given image through a conditional sampler that samples from the latent space of the variational text encoder, which is then decoded to the output depth map. Our approach is trained alternatingly between the text and image branches: in one optimization step, we predict the mean and standard deviation from the text description and sample from a standard Gaussian, and in the other, we sample using a (image) conditional sampler. Once trained, we directly predict depth from the encoded text using the conditional sampler. We demonstrate our approach on indoor (NYUv2) and outdoor (KITTI) scenarios, where we show that language can consistently improve performance in both.
Paper Structure (13 sections, 4 equations, 6 figures, 6 tables)

This paper contains 13 sections, 4 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Language as a prior for depth estimation. Depth estimation from a single image is an ill-posed problem (i.e., scale), and likewise from text captions (i.e., layout). Can two inherently ambiguous modalities resolve metric-scaled depth estimates?
  • Figure 2: Training WorDepth. We begin with optimizing text-VAE by predicting the mean and standard deviation of the latent distribution of depth maps corresponding to a text caption. We then sample $\hat{z}$ from the distribution using the reparameterization trick with $\epsilon \sim \mathcal{N}(0, 1)$ and decode it into a depth map for loss computation. We then optimize a conditional sampler by predicting patch-wise $\tilde{\epsilon}$ from an image to sample $\tilde{z}$ from the latent to yield output depth for the loss computation. The depth decoder is updated in both alternating steps.
  • Figure 3: Qualitative results on NYU Depth V2. We compare WorDepth with AdaBins bhat2021adabins. Text descriptions are generated using ExpansionNet v2 ExpansionNet_v2. Overall, WorDepth improves uniformly across the image (darker in error map), implying better scale. WorDepth also predicts more accurate depth in regions corresponding to "chairs", "window", "shower curtain", "man", and "desks", which are objects specified by text descriptions. Note: Zeros in the ground truth depth map indicate the absence of valid depth values.
  • Figure 4: Visualization of depth estimations on KITTI. Compared with AdaBins bhat2021adabins, WorDepth improves uniformly across the image (darker in error map), implying better scale. WorDepth also predicts more accurate depth in regions corresponding to "wall", "trees", "building", which are objects specified by text descriptions. Note: Zeros in ground truth depth indicate the absence of valid depth values.
  • Figure 5: Additional visualization of monocular depth estimation on NYU Depth V2.
  • ...and 1 more figures