Leveraging Stable Diffusion for Monocular Depth Estimation via Image Semantic Encoding
Jingming Xia, Guanqun Cao, Guang Ma, Yiben Luo, Qinzhao Li, John Oyekan
TL;DR
This work tackles monocular depth estimation in challenging outdoor environments by replacing text-based semantic guidance with an image-derived semantic encoding. The authors integrate a frozen latent feature extractor with a spatially enhanced SeeCoder semantic encoder and a cross-attention-enabled denoising UNet inside a Stable Diffusion framework, followed by a task-specific decoder to produce high-resolution depth maps. Evaluations on KITTI and Waymo show competitive accuracy and robustness, with SeeCoder-based semantics offering advantages over CLIP in outdoor scenes, though textureless regions remain challenging. The approach demonstrates strong generalization potential and suggests applicability to other visual perception tasks beyond depth estimation.
Abstract
Monocular depth estimation involves predicting depth from a single RGB image and plays a crucial role in applications such as autonomous driving, robotic navigation, 3D reconstruction, etc. Recent advancements in learning-based methods have significantly improved depth estimation performance. Generative models, particularly Stable Diffusion, have shown remarkable potential in recovering fine details and reconstructing missing regions through large-scale training on diverse datasets. However, models like CLIP, which rely on textual embeddings, face limitations in complex outdoor environments where rich context information is needed. These limitations reduce their effectiveness in such challenging scenarios. Here, we propose a novel image-based semantic embedding that extracts contextual information directly from visual features, significantly improving depth prediction in complex environments. Evaluated on the KITTI and Waymo datasets, our method achieves performance comparable to state-of-the-art models while addressing the shortcomings of CLIP embeddings in handling outdoor scenes. By leveraging visual semantics directly, our method demonstrates enhanced robustness and adaptability in depth estimation tasks, showcasing its potential for application to other visual perception tasks.
