Table of Contents
Fetching ...

Large Language Models Can Understanding Depth from Monocular Images

Zhongyi Xia, Tianzhao Wu

TL;DR

This work investigates using pretrained large language models to infer monocular depth by bridging vision and language. It introduces LLM-MDE, a multimodal framework that couples a Vision Transformer with an LLM via cross-modal reprogramming and adaptive depth prompts, and employs a ResNet-based adaptation head to generate depth maps. The approach uses LoRA to keep resource usage low and a scale-invariant loss to stabilize training, demonstrating strong few-/zero-shot performance on real-world MDE data. Ablation studies confirm the value of adaptive prompts and LoRA, and hyper-parameter analyses guide effective tuning. Overall, the paper shows that LLMs can serve as interpretable depth-reasoning engines when equipped with targeted cross-modal alignment and prompt-generation components.

Abstract

Monocular depth estimation is a critical function in computer vision applications. This paper shows that large language models (LLMs) can effectively interpret depth with minimal supervision, using efficient resource utilization and a consistent neural network architecture. We introduce LLM-MDE, a multimodal framework that deciphers depth through language comprehension. Specifically, LLM-MDE employs two main strategies to enhance the pretrained LLM's capability for depth estimation: cross-modal reprogramming and an adaptive prompt estimation module. These strategies align vision representations with text prototypes and automatically generate prompts based on monocular images, respectively. Comprehensive experiments on real-world MDE datasets confirm the effectiveness and superiority of LLM-MDE, which excels in few-/zero-shot tasks while minimizing resource use. The source code is available.

Large Language Models Can Understanding Depth from Monocular Images

TL;DR

This work investigates using pretrained large language models to infer monocular depth by bridging vision and language. It introduces LLM-MDE, a multimodal framework that couples a Vision Transformer with an LLM via cross-modal reprogramming and adaptive depth prompts, and employs a ResNet-based adaptation head to generate depth maps. The approach uses LoRA to keep resource usage low and a scale-invariant loss to stabilize training, demonstrating strong few-/zero-shot performance on real-world MDE data. Ablation studies confirm the value of adaptive prompts and LoRA, and hyper-parameter analyses guide effective tuning. Overall, the paper shows that LLMs can serve as interpretable depth-reasoning engines when equipped with targeted cross-modal alignment and prompt-generation components.

Abstract

Monocular depth estimation is a critical function in computer vision applications. This paper shows that large language models (LLMs) can effectively interpret depth with minimal supervision, using efficient resource utilization and a consistent neural network architecture. We introduce LLM-MDE, a multimodal framework that deciphers depth through language comprehension. Specifically, LLM-MDE employs two main strategies to enhance the pretrained LLM's capability for depth estimation: cross-modal reprogramming and an adaptive prompt estimation module. These strategies align vision representations with text prototypes and automatically generate prompts based on monocular images, respectively. Comprehensive experiments on real-world MDE datasets confirm the effectiveness and superiority of LLM-MDE, which excels in few-/zero-shot tasks while minimizing resource use. The source code is available.
Paper Structure (11 sections, 3 equations, 5 figures, 5 tables)

This paper contains 11 sections, 3 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Visual results of the few-shot experiments with limited resources.
  • Figure 2: Visual results of the cross-domain zero-shot experiments.
  • Figure 3: Visual results of the prompts ablation study.
  • Figure 4: Visual results of the LoRA fine-tuning experiments.
  • Figure 5: Visual results of the hyperparameter sensitivity fine-tuning experiments. The detailed information about 8 scheme can be found at Tab. \ref{['fig:Hyper-parameter Sensitivity']}.