Vision-Language Embodiment for Monocular Depth Estimation
Jinchang Zhang, Guoyu Lu
TL;DR
This work tackles the ill-posed problem of monocular depth estimation by embedding the camera model and leveraging vision-language priors. It introduces Embodied Scene Depth, computed from camera intrinsics and real-time road geometry, and fuses it with RGB features; a Depth-Guided Text Variational Auto-Encoder uses textual priors to constrain plausible scene layouts via a latent distribution sampled as $ ilde{z} = ext{mean} + ext{noise} imes ext{std}$ and decoded to depth. A cross-attention-based fusion and an image-conditioned conditional sampler integrate embodied depth with visual cues, while textual descriptions provide scale and semantic guidance, enabling better depth estimation under ambiguity. The method achieves state-of-the-art or competitive results on KITTI and DDAD, with KITTI RMSE improving to $1.654$ and DDAD to $8.673$, demonstrating dense, geometry-consistent depth without extra hardware and real-time adaptability in dynamic environments.
Abstract
Depth estimation is a core problem in robotic perception and vision tasks, but 3D reconstruction from a single image presents inherent uncertainties. Current depth estimation models primarily rely on inter-image relationships for supervised training, often overlooking the intrinsic information provided by the camera itself. We propose a method that embodies the camera model and its physical characteristics into a deep learning model, computing embodied scene depth through real-time interactions with road environments. The model can calculate embodied scene depth in real-time based on immediate environmental changes using only the intrinsic properties of the camera, without any additional equipment. By combining embodied scene depth with RGB image features, the model gains a comprehensive perspective on both geometric and visual details. Additionally, we incorporate text descriptions containing environmental content and depth information as priors for scene understanding, enriching the model's perception of objects. This integration of image and language - two inherently ambiguous modalities - leverages their complementary strengths for monocular depth estimation. The real-time nature of the embodied language and depth prior model ensures that the model can continuously adjust its perception and behavior in dynamic environments. Experimental results show that the embodied depth estimation method enhances model performance across different scenes.
