Table of Contents
Fetching ...

Retaining and Enhancing Pre-trained Knowledge in Vision-Language Models with Prompt Ensembling

Donggeun Kim, Yujin Jo, Myungjoo Lee, Taesup Kim

TL;DR

This work tackles the challenge of preserving zero-shot capabilities in vision-language models like CLIP while incorporating domain-specific knowledge. It introduces Group-wise Prompt Ensemble (GPE), a prompt-based framework that uses prompt grouping, masked attention, auxiliary prompts, and covariance-regularized ensemble learning to separate and merge diverse knowledge sources without eroding pre-trained representations. Empirical results show that GPE achieves strong base-to-new generalization and robust cross-dataset transfer, often maintaining near zero-shot performance after fine-tuning on niche domains. The analysis demonstrates that prompt diversification, grouping, and carefully designed ensemble strategies are key to improving adaptability while maintaining generalization in real-world vision-language tasks.

Abstract

The advancement of vision-language models, particularly the Contrastive Language-Image Pre-training (CLIP) model, has revolutionized the field of machine learning by enabling robust zero-shot learning capabilities. These capabilities allow models to understand and respond to previously unseen data without task-specific training. However, adapting CLIP to integrate specialized knowledge from various domains while retaining its zero-shot capabilities remains a significant challenge. To address this, we introduce a novel prompt ensemble learning approach called Group-wise Prompt Ensemble (GPE). This method aims to enhance CLIP's zero-shot capabilities by incorporating new domain knowledge while improving its adaptability and robustness against data distribution shifts. Our approach hinges on three main strategies: prompt grouping with masked attention to optimize CLIP's adaptability while safeguarding its zero-shot capabilities; the incorporation of auxiliary prompts for the seamless integration of new domain insights without disrupting the original model's representation; and an ensemble learning strategy that effectively merges original and new knowledge. Through rigorous experimentation, including more challenging cross-dataset transfer evaluations, our GPE method redefines the benchmarks for the adaptability and efficiency of vision-language models, surpassing existing models across various scenarios.

Retaining and Enhancing Pre-trained Knowledge in Vision-Language Models with Prompt Ensembling

TL;DR

This work tackles the challenge of preserving zero-shot capabilities in vision-language models like CLIP while incorporating domain-specific knowledge. It introduces Group-wise Prompt Ensemble (GPE), a prompt-based framework that uses prompt grouping, masked attention, auxiliary prompts, and covariance-regularized ensemble learning to separate and merge diverse knowledge sources without eroding pre-trained representations. Empirical results show that GPE achieves strong base-to-new generalization and robust cross-dataset transfer, often maintaining near zero-shot performance after fine-tuning on niche domains. The analysis demonstrates that prompt diversification, grouping, and carefully designed ensemble strategies are key to improving adaptability while maintaining generalization in real-world vision-language tasks.

Abstract

The advancement of vision-language models, particularly the Contrastive Language-Image Pre-training (CLIP) model, has revolutionized the field of machine learning by enabling robust zero-shot learning capabilities. These capabilities allow models to understand and respond to previously unseen data without task-specific training. However, adapting CLIP to integrate specialized knowledge from various domains while retaining its zero-shot capabilities remains a significant challenge. To address this, we introduce a novel prompt ensemble learning approach called Group-wise Prompt Ensemble (GPE). This method aims to enhance CLIP's zero-shot capabilities by incorporating new domain knowledge while improving its adaptability and robustness against data distribution shifts. Our approach hinges on three main strategies: prompt grouping with masked attention to optimize CLIP's adaptability while safeguarding its zero-shot capabilities; the incorporation of auxiliary prompts for the seamless integration of new domain insights without disrupting the original model's representation; and an ensemble learning strategy that effectively merges original and new knowledge. Through rigorous experimentation, including more challenging cross-dataset transfer evaluations, our GPE method redefines the benchmarks for the adaptability and efficiency of vision-language models, surpassing existing models across various scenarios.

Paper Structure

This paper contains 19 sections, 10 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Cross-Dataset Evaluation across various source datasets. This evaluation measures how well models trained on a specific source dataset (e.g., ImageNet, Flowers102, FGVCAircraft) generalize when tested on 10 other target datasets, relative to CLIP’s zero-shot performance. When trained on a general dataset like (a), most models maintain or even exceed CLIP's zero-shot performance. However, when fine-tuned on specialized datasets like (b) and (c) and evaluated on other datasets, baseline models show significant performance drops. In contrast, our model, GPE, demonstrates strong performance even on these niche datasets, highlighting its ability to adapt without losing generalization.
  • Figure 2: Overview of our framework. The framework consists of a Text Encoder with Grouped Prompts($P_{t}$) and an Image Encoder with Grouped Prompts($P_{v}$). The first group of the main prompts($P_{t}^{1}$, $P_{v}^{1}$) is shown in blue, the second group($P_{t}^{2}$, $P_{v}^{2}$) in red, and the auxiliary prompts($P_{t}^{\prime}$, $P_{v}^{\prime}$) in gray. During training, we utilize a Group-wise Ensemble approach, while for inference, we employ a Full Ensemble strategy. The Full Ensemble Inference process effectively integrates diverse insights from all prompt groups and the special token (white) from CLIP to enhance predictive performance.
  • Figure 3: Attention masks of GPE. In transformer models, attention masks determine which input parts can interact by allowing or blocking connections between them. The colored boxes indicate areas where attention occurs, while the white boxes indicate masked regions. The first group prompts are restricted to reading the input only, without modifying it. In the second group prompts, masking allows attention to both the input and auxiliary. Each prompt token can perform self-attention by attending to itself. The auxiliary prompts attend only to themselves, with all other features remaining masked.