Table of Contents
Fetching ...

Mixture of Prompt Learning for Vision Language Models

Yu Du, Tong Niu, Rong Zhao

TL;DR

This work tackles adapting vision-language models like CLIP through prompt learning by addressing two core issues: style variation within datasets and overfitting during prompt tuning. It introduces MoCoOp, a mixture of soft prompts with a routing module that selects top-$K$ prompts per instance, a hard prompt guided gating loss to align routing with hard templates, and semantically grouped text-level supervision to preserve prior knowledge. Empirical results across 11 datasets show consistent gains in few-shot learning, base-to-new generalization, and domain generalization over strong baselines, while maintaining inference costs near a single prompt. The approach enhances robustness and adaptability of VLMs for downstream tasks and opens doors to richer, group-aware prompting strategies, with future work including extensions to vision prompts and automated hard-template grouping.

Abstract

As powerful pre-trained vision-language models (VLMs) like CLIP gain prominence, numerous studies have attempted to combine VLMs for downstream tasks. Among these, prompt learning has been validated as an effective method for adapting to new tasks, which only requiring a small number of parameters. However, current prompt learning methods face two challenges: first, a single soft prompt struggles to capture the diverse styles and patterns within a dataset; second, fine-tuning soft prompts is prone to overfitting. To address these challenges, we propose a mixture of soft prompt learning method incorporating a routing module. This module is able to capture a dataset's varied styles and dynamically selects the most suitable prompts for each instance. Additionally, we introduce a novel gating mechanism to ensure the router selects prompts based on their similarity to hard prompt templates, which both retaining knowledge from hard prompts and improving selection accuracy. We also implement semantically grouped text-level supervision, initializing each soft prompt with the token embeddings of manually designed templates from its group and applied a contrastive loss between the resulted text feature and hard prompt encoded text feature. This supervision ensures that the text features derived from soft prompts remain close to those from their corresponding hard prompts, preserving initial knowledge and mitigating overfitting. Our method has been validated on 11 datasets, demonstrating evident improvements in few-shot learning, domain generalization, and base-to-new generalization scenarios compared to existing baselines. The code will be available at \url{https://anonymous.4open.science/r/mocoop-6387}

Mixture of Prompt Learning for Vision Language Models

TL;DR

This work tackles adapting vision-language models like CLIP through prompt learning by addressing two core issues: style variation within datasets and overfitting during prompt tuning. It introduces MoCoOp, a mixture of soft prompts with a routing module that selects top- prompts per instance, a hard prompt guided gating loss to align routing with hard templates, and semantically grouped text-level supervision to preserve prior knowledge. Empirical results across 11 datasets show consistent gains in few-shot learning, base-to-new generalization, and domain generalization over strong baselines, while maintaining inference costs near a single prompt. The approach enhances robustness and adaptability of VLMs for downstream tasks and opens doors to richer, group-aware prompting strategies, with future work including extensions to vision prompts and automated hard-template grouping.

Abstract

As powerful pre-trained vision-language models (VLMs) like CLIP gain prominence, numerous studies have attempted to combine VLMs for downstream tasks. Among these, prompt learning has been validated as an effective method for adapting to new tasks, which only requiring a small number of parameters. However, current prompt learning methods face two challenges: first, a single soft prompt struggles to capture the diverse styles and patterns within a dataset; second, fine-tuning soft prompts is prone to overfitting. To address these challenges, we propose a mixture of soft prompt learning method incorporating a routing module. This module is able to capture a dataset's varied styles and dynamically selects the most suitable prompts for each instance. Additionally, we introduce a novel gating mechanism to ensure the router selects prompts based on their similarity to hard prompt templates, which both retaining knowledge from hard prompts and improving selection accuracy. We also implement semantically grouped text-level supervision, initializing each soft prompt with the token embeddings of manually designed templates from its group and applied a contrastive loss between the resulted text feature and hard prompt encoded text feature. This supervision ensures that the text features derived from soft prompts remain close to those from their corresponding hard prompts, preserving initial knowledge and mitigating overfitting. Our method has been validated on 11 datasets, demonstrating evident improvements in few-shot learning, domain generalization, and base-to-new generalization scenarios compared to existing baselines. The code will be available at \url{https://anonymous.4open.science/r/mocoop-6387}
Paper Structure (20 sections, 12 equations, 3 figures, 4 tables)

This paper contains 20 sections, 12 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: For a dataset, the existing hard templates can be divided into different sets based on the different styles and patterns they describe in the images (such as different contents within the different colored blocks). Furthermore, one image can simultaneously possess multiple different styles. Traditionally, only one soft prompt is used to fit all images, but we use multiple soft prompts. Each soft prompt represents a style, and a router selects the best matches. This approach better bridges the gap between visual and text features by taking different styles into consideration.
  • Figure 2: Overview of MoCoOp. The orange lines signify the extra flow for training while the black lines are shared by training and inference. During inference, two soft prompts with the highest probabilities are selected and combined with the available classes for text encoding. The resulting text features are averaged and used for classification. During training, the hard prompt guided routing and semantically grouped text level supervision are introduced to supervise the router and soft prompts respectively. In our experiments, we set k to 2.
  • Figure 3: The few-shot learning results on 11 datasets. We plot the results across 1,2,4,8,16 shots. It can be seen that our MoCoOp consistently and significantly surpasses CoOp, LASP, and the Linear Probe approach across most datasets. This is evident in the average accuracy displayed in the top left corner. For LASP bulat2022lasp, we use our reproduced results.