Large Language Models Can Understanding Depth from Monocular Images
Zhongyi Xia, Tianzhao Wu
TL;DR
This work investigates using pretrained large language models to infer monocular depth by bridging vision and language. It introduces LLM-MDE, a multimodal framework that couples a Vision Transformer with an LLM via cross-modal reprogramming and adaptive depth prompts, and employs a ResNet-based adaptation head to generate depth maps. The approach uses LoRA to keep resource usage low and a scale-invariant loss to stabilize training, demonstrating strong few-/zero-shot performance on real-world MDE data. Ablation studies confirm the value of adaptive prompts and LoRA, and hyper-parameter analyses guide effective tuning. Overall, the paper shows that LLMs can serve as interpretable depth-reasoning engines when equipped with targeted cross-modal alignment and prompt-generation components.
Abstract
Monocular depth estimation is a critical function in computer vision applications. This paper shows that large language models (LLMs) can effectively interpret depth with minimal supervision, using efficient resource utilization and a consistent neural network architecture. We introduce LLM-MDE, a multimodal framework that deciphers depth through language comprehension. Specifically, LLM-MDE employs two main strategies to enhance the pretrained LLM's capability for depth estimation: cross-modal reprogramming and an adaptive prompt estimation module. These strategies align vision representations with text prototypes and automatically generate prompts based on monocular images, respectively. Comprehensive experiments on real-world MDE datasets confirm the effectiveness and superiority of LLM-MDE, which excels in few-/zero-shot tasks while minimizing resource use. The source code is available.
