Large Language Models Can Understanding Depth from Monocular Images

Zhongyi Xia; Tianzhao Wu

Large Language Models Can Understanding Depth from Monocular Images

Zhongyi Xia, Tianzhao Wu

TL;DR

This work investigates using pretrained large language models to infer monocular depth by bridging vision and language. It introduces LLM-MDE, a multimodal framework that couples a Vision Transformer with an LLM via cross-modal reprogramming and adaptive depth prompts, and employs a ResNet-based adaptation head to generate depth maps. The approach uses LoRA to keep resource usage low and a scale-invariant loss to stabilize training, demonstrating strong few-/zero-shot performance on real-world MDE data. Ablation studies confirm the value of adaptive prompts and LoRA, and hyper-parameter analyses guide effective tuning. Overall, the paper shows that LLMs can serve as interpretable depth-reasoning engines when equipped with targeted cross-modal alignment and prompt-generation components.

Abstract

Monocular depth estimation is a critical function in computer vision applications. This paper shows that large language models (LLMs) can effectively interpret depth with minimal supervision, using efficient resource utilization and a consistent neural network architecture. We introduce LLM-MDE, a multimodal framework that deciphers depth through language comprehension. Specifically, LLM-MDE employs two main strategies to enhance the pretrained LLM's capability for depth estimation: cross-modal reprogramming and an adaptive prompt estimation module. These strategies align vision representations with text prototypes and automatically generate prompts based on monocular images, respectively. Comprehensive experiments on real-world MDE datasets confirm the effectiveness and superiority of LLM-MDE, which excels in few-/zero-shot tasks while minimizing resource use. The source code is available.

Large Language Models Can Understanding Depth from Monocular Images

TL;DR

Abstract

Paper Structure (11 sections, 3 equations, 5 figures, 5 tables)

This paper contains 11 sections, 3 equations, 5 figures, 5 tables.

Introduction
Methodology
Cross-modal Reprogramming between Vision and Text
Adaptive Depth Prompts Generation Module
Depth Projection from Adaption Head
Lightweight Operations and Optimization
Experiments
Few-Shot and Zero-Shot Experiments
Ablation Experiments
Hyper-parameter Sensitivity
Conclusions

Figures (5)

Figure 1: Visual results of the few-shot experiments with limited resources.
Figure 2: Visual results of the cross-domain zero-shot experiments.
Figure 3: Visual results of the prompts ablation study.
Figure 4: Visual results of the LoRA fine-tuning experiments.
Figure 5: Visual results of the hyperparameter sensitivity fine-tuning experiments. The detailed information about 8 scheme can be found at Tab. \ref{['fig:Hyper-parameter Sensitivity']}.

Large Language Models Can Understanding Depth from Monocular Images

TL;DR

Abstract

Large Language Models Can Understanding Depth from Monocular Images

Authors

TL;DR

Abstract

Table of Contents

Figures (5)