Iris: Integrating Language into Diffusion-based Monocular Depth Estimation
Ziyao Zeng, Jingcheng Ni, Daniel Wang, Patrick Rim, Younjoon Chung, Fengyu Yang, Byung-Woo Hong, Alex Wong
TL;DR
This work tackles monocular depth estimation by addressing inherent visual ambiguities through language-conditioned diffusion. By integrating text descriptions as an additional conditioning input learned during text-to-image diffusion pre-training, Iris learns a conditional distribution that aligns depth predictions with plausible 3D scene structures. The approach is evaluated across three diffusion-based depth estimators (Marigold, Lotus, E2E-FT) trained on HyperSim and Virtual KITTI and tested in zero-shot on five real-world datasets, showing improved accuracy, especially in small or text-described regions, and faster convergence during both training and inference. The results demonstrate the practical value of language priors for depth in-the-wild, with implications for more reliable depth perception in embodied AI, and the authors provide code and generated text data upon acceptance.
Abstract
Traditional monocular depth estimation suffers from inherent ambiguity and visual nuisances. We demonstrate that language can enhance monocular depth estimation by providing an additional condition (rather than images alone) aligned with plausible 3D scenes, thereby reducing the solution space for depth estimation. This conditional distribution is learned during the text-to-image pre-training of diffusion models. To generate images under various viewpoints and layouts that precisely reflect textual descriptions, the model implicitly models object sizes, shapes, and scales, their spatial relationships, and the overall scene structure. In this paper, Iris, we investigate the benefits of our strategy to integrate text descriptions into training and inference of diffusion-based depth estimation models. We experiment with three different diffusion-based monocular depth estimators (Marigold, Lotus, and E2E-FT) and their variants. By training on HyperSim and Virtual KITTI, and evaluating on NYUv2, KITTI, ETH3D, ScanNet, and DIODE, we find that our strategy improves the overall monocular depth estimation accuracy, especially in small areas. It also improves the model's depth perception of specific regions described in the text. We find that by providing more details in the text, the depth prediction can be iteratively refined. Simultaneously, we find that language can act as a constraint to accelerate the convergence of both training and the inference diffusion trajectory. Code and generated text data will be released upon acceptance.
