Retaining and Enhancing Pre-trained Knowledge in Vision-Language Models with Prompt Ensembling
Donggeun Kim, Yujin Jo, Myungjoo Lee, Taesup Kim
TL;DR
This work tackles the challenge of preserving zero-shot capabilities in vision-language models like CLIP while incorporating domain-specific knowledge. It introduces Group-wise Prompt Ensemble (GPE), a prompt-based framework that uses prompt grouping, masked attention, auxiliary prompts, and covariance-regularized ensemble learning to separate and merge diverse knowledge sources without eroding pre-trained representations. Empirical results show that GPE achieves strong base-to-new generalization and robust cross-dataset transfer, often maintaining near zero-shot performance after fine-tuning on niche domains. The analysis demonstrates that prompt diversification, grouping, and carefully designed ensemble strategies are key to improving adaptability while maintaining generalization in real-world vision-language tasks.
Abstract
The advancement of vision-language models, particularly the Contrastive Language-Image Pre-training (CLIP) model, has revolutionized the field of machine learning by enabling robust zero-shot learning capabilities. These capabilities allow models to understand and respond to previously unseen data without task-specific training. However, adapting CLIP to integrate specialized knowledge from various domains while retaining its zero-shot capabilities remains a significant challenge. To address this, we introduce a novel prompt ensemble learning approach called Group-wise Prompt Ensemble (GPE). This method aims to enhance CLIP's zero-shot capabilities by incorporating new domain knowledge while improving its adaptability and robustness against data distribution shifts. Our approach hinges on three main strategies: prompt grouping with masked attention to optimize CLIP's adaptability while safeguarding its zero-shot capabilities; the incorporation of auxiliary prompts for the seamless integration of new domain insights without disrupting the original model's representation; and an ensemble learning strategy that effectively merges original and new knowledge. Through rigorous experimentation, including more challenging cross-dataset transfer evaluations, our GPE method redefines the benchmarks for the adaptability and efficiency of vision-language models, surpassing existing models across various scenarios.
