Table of Contents
Fetching ...

Lightweight Prompt-Guided CLIP Adaptation for Monocular Depth Estimation

Reyhaneh Ahani Manghotay, Jie Liang

Abstract

Leveraging the rich semantic features of vision-language models (VLMs) like CLIP for monocular depth estimation tasks is a promising direction, yet often requires extensive fine-tuning or lacks geometric precision. We present a parameter-efficient framework, named MoA-DepthCLIP, that adapts pretrained CLIP representations for monocular depth estimation with minimal supervision. Our method integrates a lightweight Mixture-of-Adapters (MoA) module into the pretrained Vision Transformer (ViT-B/32) backbone combined with selective fine-tuning of the final layers. This design enables spatially-aware adaptation, guided by a global semantic context vector and a hybrid prediction architecture that synergizes depth bin classification with direct regression. To enhance structural accuracy, we employ a composite loss function that enforces geometric constraints. On the NYU Depth V2 benchmark, MoA-DepthCLIP achieves competitive results, significantly outperforming the DepthCLIP baseline by improving the $δ_1$ accuracy from 0.390 to 0.745 and reducing the RMSE from 1.176 to 0.520. These results are achieved while requiring substantially few trainable parameters, demonstrating that lightweight, prompt-guided MoA is a highly effective strategy for transferring VLM knowledge to fine-grained monocular depth estimation tasks.

Lightweight Prompt-Guided CLIP Adaptation for Monocular Depth Estimation

Abstract

Leveraging the rich semantic features of vision-language models (VLMs) like CLIP for monocular depth estimation tasks is a promising direction, yet often requires extensive fine-tuning or lacks geometric precision. We present a parameter-efficient framework, named MoA-DepthCLIP, that adapts pretrained CLIP representations for monocular depth estimation with minimal supervision. Our method integrates a lightweight Mixture-of-Adapters (MoA) module into the pretrained Vision Transformer (ViT-B/32) backbone combined with selective fine-tuning of the final layers. This design enables spatially-aware adaptation, guided by a global semantic context vector and a hybrid prediction architecture that synergizes depth bin classification with direct regression. To enhance structural accuracy, we employ a composite loss function that enforces geometric constraints. On the NYU Depth V2 benchmark, MoA-DepthCLIP achieves competitive results, significantly outperforming the DepthCLIP baseline by improving the accuracy from 0.390 to 0.745 and reducing the RMSE from 1.176 to 0.520. These results are achieved while requiring substantially few trainable parameters, demonstrating that lightweight, prompt-guided MoA is a highly effective strategy for transferring VLM knowledge to fine-grained monocular depth estimation tasks.

Paper Structure

This paper contains 26 sections, 6 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Overall architecture of MoA-DepthCLIP. Scene prompts are encoded using a frozen CLIP text encoder to form a global scene context vector. The image is encoded with a frozen ViT-B/32 backbone augmented with MoAs. Fused features are then passed to a dual-head prediction module: one head performs depth bin classification and produces a binned depth map via weighted summation, while the other head performs direct regression. The final output depth map is a fusion of both predictions.
  • Figure 2: Overview of our lightweight adaptation strategy. (a) Illustrates the selective placement of MoA modules within the ViT backbone. (b) Details the internal architecture of a single MoA module.