Table of Contents
Fetching ...

Vision-Language Embodiment for Monocular Depth Estimation

Jinchang Zhang, Guoyu Lu

TL;DR

This work tackles the ill-posed problem of monocular depth estimation by embedding the camera model and leveraging vision-language priors. It introduces Embodied Scene Depth, computed from camera intrinsics and real-time road geometry, and fuses it with RGB features; a Depth-Guided Text Variational Auto-Encoder uses textual priors to constrain plausible scene layouts via a latent distribution sampled as $ ilde{z} = ext{mean} + ext{noise} imes ext{std}$ and decoded to depth. A cross-attention-based fusion and an image-conditioned conditional sampler integrate embodied depth with visual cues, while textual descriptions provide scale and semantic guidance, enabling better depth estimation under ambiguity. The method achieves state-of-the-art or competitive results on KITTI and DDAD, with KITTI RMSE improving to $1.654$ and DDAD to $8.673$, demonstrating dense, geometry-consistent depth without extra hardware and real-time adaptability in dynamic environments.

Abstract

Depth estimation is a core problem in robotic perception and vision tasks, but 3D reconstruction from a single image presents inherent uncertainties. Current depth estimation models primarily rely on inter-image relationships for supervised training, often overlooking the intrinsic information provided by the camera itself. We propose a method that embodies the camera model and its physical characteristics into a deep learning model, computing embodied scene depth through real-time interactions with road environments. The model can calculate embodied scene depth in real-time based on immediate environmental changes using only the intrinsic properties of the camera, without any additional equipment. By combining embodied scene depth with RGB image features, the model gains a comprehensive perspective on both geometric and visual details. Additionally, we incorporate text descriptions containing environmental content and depth information as priors for scene understanding, enriching the model's perception of objects. This integration of image and language - two inherently ambiguous modalities - leverages their complementary strengths for monocular depth estimation. The real-time nature of the embodied language and depth prior model ensures that the model can continuously adjust its perception and behavior in dynamic environments. Experimental results show that the embodied depth estimation method enhances model performance across different scenes.

Vision-Language Embodiment for Monocular Depth Estimation

TL;DR

This work tackles the ill-posed problem of monocular depth estimation by embedding the camera model and leveraging vision-language priors. It introduces Embodied Scene Depth, computed from camera intrinsics and real-time road geometry, and fuses it with RGB features; a Depth-Guided Text Variational Auto-Encoder uses textual priors to constrain plausible scene layouts via a latent distribution sampled as and decoded to depth. A cross-attention-based fusion and an image-conditioned conditional sampler integrate embodied depth with visual cues, while textual descriptions provide scale and semantic guidance, enabling better depth estimation under ambiguity. The method achieves state-of-the-art or competitive results on KITTI and DDAD, with KITTI RMSE improving to and DDAD to , demonstrating dense, geometry-consistent depth without extra hardware and real-time adaptability in dynamic environments.

Abstract

Depth estimation is a core problem in robotic perception and vision tasks, but 3D reconstruction from a single image presents inherent uncertainties. Current depth estimation models primarily rely on inter-image relationships for supervised training, often overlooking the intrinsic information provided by the camera itself. We propose a method that embodies the camera model and its physical characteristics into a deep learning model, computing embodied scene depth through real-time interactions with road environments. The model can calculate embodied scene depth in real-time based on immediate environmental changes using only the intrinsic properties of the camera, without any additional equipment. By combining embodied scene depth with RGB image features, the model gains a comprehensive perspective on both geometric and visual details. Additionally, we incorporate text descriptions containing environmental content and depth information as priors for scene understanding, enriching the model's perception of objects. This integration of image and language - two inherently ambiguous modalities - leverages their complementary strengths for monocular depth estimation. The real-time nature of the embodied language and depth prior model ensures that the model can continuously adjust its perception and behavior in dynamic environments. Experimental results show that the embodied depth estimation method enhances model performance across different scenes.

Paper Structure

This paper contains 18 sections, 14 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Overview of the framework. We utilize a plug-and-play pre-trained image segmentation model to obtain segmentation results from images and incorporate the camera model to calculate embodied scene depth. We extract the textual semantic description of the image and derive object depth descriptions based on semantic segmentation and embodied scene depth, merging them into a textual description. The text encoder is used to predict the mean and standard deviation of the latent distribution corresponding to the depth map of the textual description. We then sample $\hat{z}$ from the distribution using the reparameterization trick, where $\epsilon \sim N(0,1)$, and decode it into a depth map for loss computation. In the feature fusion module, we extract features from the embodied scene depth and RGB image and use a cross-attention mechanism for feature fusion. Next, we optimize a conditional sampler by predicting patch-wise $\tilde{\epsilon}$ from the fused features to sample $\tilde{z}$ from the latent space, and output the depth through the depth decoder. The text and image depth decoders share weights and are updated in both alternating steps.
  • Figure 2: Embodied Depth Perception on KITTI: (a) Semantic segmented image; (b) RGB image; (c) Embodied Surface Depth; (d) Road segmented from semantic segmented image; (e)Embodied Road Depth; (f) Ground segmented from semantic segmented image; (g) Embodied Ground Depth (h) Extended Embodied Ground Depth; (i) Embodied Scene Depth.
  • Figure 3: Error distribution of Embodied Depth: (a) Embodied Surface Depth and error distribution; (b) Embodied Road Depth and error distribution; (c) Embodied Ground Depth and error distribution; (d) Extended Embodied Ground Depth and error distribution; (e) Embodied Scene Depth and error distribution; (f) Sparse LiDAR depth as Ground Truth.
  • Figure 4: Visual results on KITTI geiger2013vision: From top to bottom; the models are ECoDepth Patni_2024_CVPR; MIM xie2023revealing; AFNet cheng2024adaptive ; Ours.