Table of Contents
Fetching ...

MMLoP: Multi-Modal Low-Rank Prompting for Efficient Vision-Language Adaptation

Sajjad Ghiasvand, Haniyeh Ehsani Oskouie, Mahnoosh Alizadeh, Ramtin Pedarsani

TL;DR

MMLoP is proposed, a framework that achieves deep multi-modal prompting with only 11.5K trainable parameters comparable to early text-only methods like CoOp, and parameterizes vision and text prompts at each transformer layer through a low-rank factorization, which serves as an implicit regularizer against overfitting on few-shot training data.

Abstract

Prompt learning has become a dominant paradigm for adapting vision-language models (VLMs) such as CLIP to downstream tasks without modifying pretrained weights. While extending prompts to both vision and text encoders across multiple transformer layers significantly boosts performance, it dramatically increases the number of trainable parameters, with state-of-the-art methods requiring millions of parameters and abandoning the parameter efficiency that makes prompt tuning attractive. In this work, we propose \textbf{MMLoP} (\textbf{M}ulti-\textbf{M}odal \textbf{Lo}w-Rank \textbf{P}rompting), a framework that achieves deep multi-modal prompting with only \textbf{11.5K trainable parameters}, comparable to early text-only methods like CoOp. MMLoP parameterizes vision and text prompts at each transformer layer through a low-rank factorization, which serves as an implicit regularizer against overfitting on few-shot training data. To further close the accuracy gap with state-of-the-art methods, we introduce three complementary components: a self-regulating consistency loss that anchors prompted representations to frozen zero-shot CLIP features at both the feature and logit levels, a uniform drift correction that removes the global embedding shift induced by prompt tuning to preserve class-discriminative structure, and a shared up-projection that couples vision and text prompts through a common low-rank factor to enforce cross-modal alignment. Extensive experiments across three benchmarks and 11 diverse datasets demonstrate that MMLoP achieves a highly favorable accuracy-efficiency tradeoff, outperforming the majority of existing methods including those with orders of magnitude more parameters, while achieving a harmonic mean of 79.70\% on base-to-novel generalization.

MMLoP: Multi-Modal Low-Rank Prompting for Efficient Vision-Language Adaptation

TL;DR

MMLoP is proposed, a framework that achieves deep multi-modal prompting with only 11.5K trainable parameters comparable to early text-only methods like CoOp, and parameterizes vision and text prompts at each transformer layer through a low-rank factorization, which serves as an implicit regularizer against overfitting on few-shot training data.

Abstract

Prompt learning has become a dominant paradigm for adapting vision-language models (VLMs) such as CLIP to downstream tasks without modifying pretrained weights. While extending prompts to both vision and text encoders across multiple transformer layers significantly boosts performance, it dramatically increases the number of trainable parameters, with state-of-the-art methods requiring millions of parameters and abandoning the parameter efficiency that makes prompt tuning attractive. In this work, we propose \textbf{MMLoP} (\textbf{M}ulti-\textbf{M}odal \textbf{Lo}w-Rank \textbf{P}rompting), a framework that achieves deep multi-modal prompting with only \textbf{11.5K trainable parameters}, comparable to early text-only methods like CoOp. MMLoP parameterizes vision and text prompts at each transformer layer through a low-rank factorization, which serves as an implicit regularizer against overfitting on few-shot training data. To further close the accuracy gap with state-of-the-art methods, we introduce three complementary components: a self-regulating consistency loss that anchors prompted representations to frozen zero-shot CLIP features at both the feature and logit levels, a uniform drift correction that removes the global embedding shift induced by prompt tuning to preserve class-discriminative structure, and a shared up-projection that couples vision and text prompts through a common low-rank factor to enforce cross-modal alignment. Extensive experiments across three benchmarks and 11 diverse datasets demonstrate that MMLoP achieves a highly favorable accuracy-efficiency tradeoff, outperforming the majority of existing methods including those with orders of magnitude more parameters, while achieving a harmonic mean of 79.70\% on base-to-novel generalization.
Paper Structure (16 sections, 9 equations, 5 figures, 11 tables, 1 algorithm)

This paper contains 16 sections, 9 equations, 5 figures, 11 tables, 1 algorithm.

Figures (5)

  • Figure 1: Accuracy vs. number of trainable parameters for prompt learning methods on base-to-novel generalization (left) and all-to-all few-shot classification (right).
  • Figure 2: Overview of MMLoP. Both text and image encoders are equipped with deep low-rank prompts (bow-tie symbols) at each transformer layer, with vision and text prompts sharing a common up-projection matrix $\bm{U}^{(l)}$ for cross-modal alignment. Snowflake icons indicate frozen CLIP parameters. The self-regulating consistency loss ($\mathcal{L}_{\text{SCL}}$) is omitted for clarity.
  • Figure 3: Visualization of the learned shared up-projection matrix $\bm{U}^{(l)}$ across transformer layers and prompt tokens for each of the 11 datasets and their average.
  • Figure 4: All-to-all few-shot classification results on 11 datasets using the ViT-B/16 backbone across $K \in \{1, 2, 4, 8, 16\}$ shots.
  • Figure 5: All-to-all few-shot classification results on 11 datasets using the ViT-B/32 backbone across $K \in \{1, 2, 4, 8, 16\}$ shots.