Table of Contents
Fetching ...

ECoDepth: Effective Conditioning of Diffusion Models for Monocular Depth Estimation

Suraj Patni, Aradhye Agarwal, Chetan Arora

TL;DR

The paper tackles monocular depth estimation by addressing the lack of parallax cues through semantic conditioning derived from pre-trained Vision Transformer embeddings. It introduces ECoDepth, a diffusion-based model that uses a novel CIDE module to convert ViT information into a 768-dimensional conditioning vector for a latent-diffusion backbone, enabling dense depth prediction from a single image. The approach achieves state-of-the-art results on NYU Depth v2 and competitive performance on KITTI, while delivering strong zero-shot transfer to unseen indoor datasets without multi-dataset pretraining. This work demonstrates that rich, transformer-based semantic context can substantially improve SIDE accuracy and generalization, with practical implications for robust depth perception in diverse environments.

Abstract

In the absence of parallax cues, a learning-based single image depth estimation (SIDE) model relies heavily on shading and contextual cues in the image. While this simplicity is attractive, it is necessary to train such models on large and varied datasets, which are difficult to capture. It has been shown that using embeddings from pre-trained foundational models, such as CLIP, improves zero shot transfer in several applications. Taking inspiration from this, in our paper we explore the use of global image priors generated from a pre-trained ViT model to provide more detailed contextual information. We argue that the embedding vector from a ViT model, pre-trained on a large dataset, captures greater relevant information for SIDE than the usual route of generating pseudo image captions, followed by CLIP based text embeddings. Based on this idea, we propose a new SIDE model using a diffusion backbone which is conditioned on ViT embeddings. Our proposed design establishes a new state-of-the-art (SOTA) for SIDE on NYUv2 dataset, achieving Abs Rel error of 0.059 (14% improvement) compared to 0.069 by the current SOTA (VPD). And on KITTI dataset, achieving Sq Rel error of 0.139 (2% improvement) compared to 0.142 by the current SOTA (GEDepth). For zero-shot transfer with a model trained on NYUv2, we report mean relative improvement of (20%, 23%, 81%, 25%) over NeWCRFs on (Sun-RGBD, iBims1, DIODE, HyperSim) datasets, compared to (16%, 18%, 45%, 9%) by ZoeDepth. The project page is available at https://ecodepth-iitd.github.io

ECoDepth: Effective Conditioning of Diffusion Models for Monocular Depth Estimation

TL;DR

The paper tackles monocular depth estimation by addressing the lack of parallax cues through semantic conditioning derived from pre-trained Vision Transformer embeddings. It introduces ECoDepth, a diffusion-based model that uses a novel CIDE module to convert ViT information into a 768-dimensional conditioning vector for a latent-diffusion backbone, enabling dense depth prediction from a single image. The approach achieves state-of-the-art results on NYU Depth v2 and competitive performance on KITTI, while delivering strong zero-shot transfer to unseen indoor datasets without multi-dataset pretraining. This work demonstrates that rich, transformer-based semantic context can substantially improve SIDE accuracy and generalization, with practical implications for robust depth perception in diverse environments.

Abstract

In the absence of parallax cues, a learning-based single image depth estimation (SIDE) model relies heavily on shading and contextual cues in the image. While this simplicity is attractive, it is necessary to train such models on large and varied datasets, which are difficult to capture. It has been shown that using embeddings from pre-trained foundational models, such as CLIP, improves zero shot transfer in several applications. Taking inspiration from this, in our paper we explore the use of global image priors generated from a pre-trained ViT model to provide more detailed contextual information. We argue that the embedding vector from a ViT model, pre-trained on a large dataset, captures greater relevant information for SIDE than the usual route of generating pseudo image captions, followed by CLIP based text embeddings. Based on this idea, we propose a new SIDE model using a diffusion backbone which is conditioned on ViT embeddings. Our proposed design establishes a new state-of-the-art (SOTA) for SIDE on NYUv2 dataset, achieving Abs Rel error of 0.059 (14% improvement) compared to 0.069 by the current SOTA (VPD). And on KITTI dataset, achieving Sq Rel error of 0.139 (2% improvement) compared to 0.142 by the current SOTA (GEDepth). For zero-shot transfer with a model trained on NYUv2, we report mean relative improvement of (20%, 23%, 81%, 25%) over NeWCRFs on (Sun-RGBD, iBims1, DIODE, HyperSim) datasets, compared to (16%, 18%, 45%, 9%) by ZoeDepth. The project page is available at https://ecodepth-iitd.github.io
Paper Structure (19 sections, 5 equations, 12 figures, 7 tables)

This paper contains 19 sections, 5 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Qualitative results across four different datasets, demonstrating the zero-shot performance of our model trained only on the NYU Depth v2 dataset. Corresponding quantitative results are presented in Table \ref{['tab:zero-shot-indoors']}. The first column displays RGB images, the second column depicts ground truth depth, and the third column showcases our model's predicted depths. Additional images for each dataset are available in the Supplementary Material.
  • Figure 2: An overview of our proposed model: The latent representation of the input image undergoes a diffusion process, which is conditioned by our proposed CIDE module. Within the CIDE module, the input image is fed through the frozen ViT model. From this, a linear combination of the learnt embeddings is computed, which is transformed to generate a 768-dimensional contextual embedding. This embedding is utilized to condition the diffusion backbone. Subsequently, hierarchical feature maps are extracted from the UNet's decoder which are concatenated and processed through a depth regressor to generate the depth map.
  • Figure 3: (a) Probabilistic graphical model corresponding to VPD. (b) The same corresponding to our formulation. Here, $\mathbf{\mathcal{C}}$ represents the semantic embedding derived from our CIDE module. This embedding is internally generated by passing $\mathbf{x}$ through the ViT, resulting in $\mathcal{E}$. Subsequently, $\mathcal{E}$ undergoes further processing to yield $\mathbf{\mathcal{C}}$, which is then utilized in the conditional diffusion module implementing $\mathbb{P}(\mathbf{z}_0 \mid \mathbf{z}_t, \mathbf{\mathcal{C}})$. The output of the conditional diffusion module is fed into the Depth Regressor module within our architecture, implementing $\mathbb{P}(\mathbf{y} \mid \mathbf{z}_0)$.
  • Figure 4: Visual Comparison on NYU Depth v2 Indoor Dataset. Note, our method's ability to delineate objects in terms of their depth, such as the table lamp in Row 5, even when such information is absent from the ground truth depth map.
  • Figure 5: Visual Comparison on KITTI Outdoor Dataset.
  • ...and 7 more figures