Table of Contents
Fetching ...

Iris: Integrating Language into Diffusion-based Monocular Depth Estimation

Ziyao Zeng, Jingcheng Ni, Daniel Wang, Patrick Rim, Younjoon Chung, Fengyu Yang, Byung-Woo Hong, Alex Wong

TL;DR

This work tackles monocular depth estimation by addressing inherent visual ambiguities through language-conditioned diffusion. By integrating text descriptions as an additional conditioning input learned during text-to-image diffusion pre-training, Iris learns a conditional distribution that aligns depth predictions with plausible 3D scene structures. The approach is evaluated across three diffusion-based depth estimators (Marigold, Lotus, E2E-FT) trained on HyperSim and Virtual KITTI and tested in zero-shot on five real-world datasets, showing improved accuracy, especially in small or text-described regions, and faster convergence during both training and inference. The results demonstrate the practical value of language priors for depth in-the-wild, with implications for more reliable depth perception in embodied AI, and the authors provide code and generated text data upon acceptance.

Abstract

Traditional monocular depth estimation suffers from inherent ambiguity and visual nuisances. We demonstrate that language can enhance monocular depth estimation by providing an additional condition (rather than images alone) aligned with plausible 3D scenes, thereby reducing the solution space for depth estimation. This conditional distribution is learned during the text-to-image pre-training of diffusion models. To generate images under various viewpoints and layouts that precisely reflect textual descriptions, the model implicitly models object sizes, shapes, and scales, their spatial relationships, and the overall scene structure. In this paper, Iris, we investigate the benefits of our strategy to integrate text descriptions into training and inference of diffusion-based depth estimation models. We experiment with three different diffusion-based monocular depth estimators (Marigold, Lotus, and E2E-FT) and their variants. By training on HyperSim and Virtual KITTI, and evaluating on NYUv2, KITTI, ETH3D, ScanNet, and DIODE, we find that our strategy improves the overall monocular depth estimation accuracy, especially in small areas. It also improves the model's depth perception of specific regions described in the text. We find that by providing more details in the text, the depth prediction can be iteratively refined. Simultaneously, we find that language can act as a constraint to accelerate the convergence of both training and the inference diffusion trajectory. Code and generated text data will be released upon acceptance.

Iris: Integrating Language into Diffusion-based Monocular Depth Estimation

TL;DR

This work tackles monocular depth estimation by addressing inherent visual ambiguities through language-conditioned diffusion. By integrating text descriptions as an additional conditioning input learned during text-to-image diffusion pre-training, Iris learns a conditional distribution that aligns depth predictions with plausible 3D scene structures. The approach is evaluated across three diffusion-based depth estimators (Marigold, Lotus, E2E-FT) trained on HyperSim and Virtual KITTI and tested in zero-shot on five real-world datasets, showing improved accuracy, especially in small or text-described regions, and faster convergence during both training and inference. The results demonstrate the practical value of language priors for depth in-the-wild, with implications for more reliable depth perception in embodied AI, and the authors provide code and generated text data upon acceptance.

Abstract

Traditional monocular depth estimation suffers from inherent ambiguity and visual nuisances. We demonstrate that language can enhance monocular depth estimation by providing an additional condition (rather than images alone) aligned with plausible 3D scenes, thereby reducing the solution space for depth estimation. This conditional distribution is learned during the text-to-image pre-training of diffusion models. To generate images under various viewpoints and layouts that precisely reflect textual descriptions, the model implicitly models object sizes, shapes, and scales, their spatial relationships, and the overall scene structure. In this paper, Iris, we investigate the benefits of our strategy to integrate text descriptions into training and inference of diffusion-based depth estimation models. We experiment with three different diffusion-based monocular depth estimators (Marigold, Lotus, and E2E-FT) and their variants. By training on HyperSim and Virtual KITTI, and evaluating on NYUv2, KITTI, ETH3D, ScanNet, and DIODE, we find that our strategy improves the overall monocular depth estimation accuracy, especially in small areas. It also improves the model's depth perception of specific regions described in the text. We find that by providing more details in the text, the depth prediction can be iteratively refined. Simultaneously, we find that language can act as a constraint to accelerate the convergence of both training and the inference diffusion trajectory. Code and generated text data will be released upon acceptance.

Paper Structure

This paper contains 5 sections, 10 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Integrating language into diffusion models enhances monocular depth estimation by providing an additional condition (rather than images alone) associated with plausible 3D scenes, thus reducing the solution space for depth estimation. This conditional distribution is initially learned during text-to-image generation pre-training of diffusion models, as to generate images under different viewpoints and layouts that accurately reflect the text, the model needs to implicitly model the size and shape of specified objects, their spatial relationship, and the structure of the scene. Then the conditional distribution is associated with plausible 3D scenes during fine-tuning with image-text-depth pairs.
  • Figure 2: Language improves the depth perception of specified insignificant (and potentially ambiguous) regions.
  • Figure 3: Pipeline to integrate text. We train the diffusion model to predict the noise added into the noisy depth latent $\mathbf{z}_t$ at the time step $t$, based on $\mathbf{z}_t$, the input image $x$, and the corresponding language description $c$. During inference, the diffusion model predicts noise for $\mathbf{z}_t$ at each time step and gradually denoise it from $\mathbf{z}_T$ (pure Gaussian noise) into $\mathbf{z}_0$ (pure depth latent). Then $\mathbf{z}_0$ is decoded into the depth prediction using a frozen variational decoder.
  • Figure 4: Visualization on NYUv2. Compared to the Marigold baseline, integrating language demonstrates more accurate depth prediction for a given input image, particularly for instances specified in the language description (marked in red). This is achieved by providing additional language conditions for the semantic and geometric characteristics of specified objects. It's particularly beneficial for ambiguous or insignificant areas that are easily neglected by visual signals, like "a soap dispenser" in the first row, and "two black lamps with circular bases" in the second row.
  • Figure 5: Visualization on KITTI. Integrating language allows predicting better depth for described objects, even when parts of the object are almost invisible in the image (such as the parked car in the first column). Additional semantic and geometrical conditions are provided for the described ambiguous and insignificant regions, such as a sign at a distance, potentially enhancing the safety of self-driving systems that rely solely on vision sensors.
  • ...and 6 more figures