Table of Contents
Fetching ...

Cluster-Aware Prompt Ensemble Learning for Few-Shot Vision-Language Model Adaptation

Zhi Chen, Xin Yu, Xiaohui Tao, Yan Li, Zi Huang

TL;DR

This work tackles suboptimal CLIP-style prompt ensembling in few-shot vision-language adaptation by shifting from text-feature averaging to logits-space aggregation. CAPEL introduces a cluster-preserving regularizer based on conditional entropy and an adaptive prompt weighting mechanism, enabling multiple class sub-prototypes to specialize without collapsing. The approach demonstrates consistent gains across 11 datasets, strong domain generalization, and applicability to segmentation, medical, and industrial domains, all with modest training overhead. The results underscore the practical impact of preserving multi-cluster structure in the visual space and leveraging diverse prompts for robust, scalable VLM adaptation.

Abstract

Vision-language models (VLMs) such as CLIP achieve zero-shot transfer across various tasks by pre-training on numerous image-text pairs. These models often benefit from using an ensemble of context prompts to represent a class. Despite being effective, conventional prompt ensembling that averages textual features of context prompts often yields suboptimal results. This is because feature averaging shifts the class centroids away from the true class distribution. To address this issue, we propose the Cluster-Aware Prompt Ensemble Learning (CAPEL) framework, which preserves the cluster nature of context prompts. CAPEL classifies images into one of several class clusters, each represented by a distinct prompt. Instead of ensembling prompts in the feature space, we perform ensembling in the classification logits space, aligning better with the visual feature distribution. To further optimize prompt fine-tuning while maintaining cluster-specific discriminative power, we introduce a cluster-preserving regularization term. This ensures that prompts remain distinct and specialized for different clusters, preventing collapse into a uniform direction. Additionally, we integrate an adaptive prompt weighting technique to dynamically adjust the attention weights for flawed or ambiguous prompts, ensuring robust performance across diverse datasets and tasks.

Cluster-Aware Prompt Ensemble Learning for Few-Shot Vision-Language Model Adaptation

TL;DR

This work tackles suboptimal CLIP-style prompt ensembling in few-shot vision-language adaptation by shifting from text-feature averaging to logits-space aggregation. CAPEL introduces a cluster-preserving regularizer based on conditional entropy and an adaptive prompt weighting mechanism, enabling multiple class sub-prototypes to specialize without collapsing. The approach demonstrates consistent gains across 11 datasets, strong domain generalization, and applicability to segmentation, medical, and industrial domains, all with modest training overhead. The results underscore the practical impact of preserving multi-cluster structure in the visual space and leveraging diverse prompts for robust, scalable VLM adaptation.

Abstract

Vision-language models (VLMs) such as CLIP achieve zero-shot transfer across various tasks by pre-training on numerous image-text pairs. These models often benefit from using an ensemble of context prompts to represent a class. Despite being effective, conventional prompt ensembling that averages textual features of context prompts often yields suboptimal results. This is because feature averaging shifts the class centroids away from the true class distribution. To address this issue, we propose the Cluster-Aware Prompt Ensemble Learning (CAPEL) framework, which preserves the cluster nature of context prompts. CAPEL classifies images into one of several class clusters, each represented by a distinct prompt. Instead of ensembling prompts in the feature space, we perform ensembling in the classification logits space, aligning better with the visual feature distribution. To further optimize prompt fine-tuning while maintaining cluster-specific discriminative power, we introduce a cluster-preserving regularization term. This ensures that prompts remain distinct and specialized for different clusters, preventing collapse into a uniform direction. Additionally, we integrate an adaptive prompt weighting technique to dynamically adjust the attention weights for flawed or ambiguous prompts, ensuring robust performance across diverse datasets and tasks.

Paper Structure

This paper contains 23 sections, 10 equations, 11 figures, 12 tables.

Figures (11)

  • Figure 1: Performance comparison with state-of-the-art few-shot adaptation methods in 16-shot setting on 11 datasets. Our proposed CAPEL consistently achieves competitive performance.
  • Figure 2: Visualization of Visual Feature Clusters in the Oxford Pets Dataset. (a) CLIP visual features are semantically rich: This plot represents the visual feature clustering generated by the vision-language model CLIP. Since CLIP is pre-trained on diverse image-text pairs, it captures contextual information in the background as well as the primary object. For example, a small cluster contains water-related backgrounds associated with breeds like Saint Bernard, Leonberger, and Newfoundland, indicating that CLIP considers both the foreground (pet) and background context. (b) Vision Transformer (ViT) Encoder: This plot shows feature clustering using a ViT model pre-trained on image-label pairs. Unlike CLIP, ViT is more focused on the primary object itself, which results in clusters that are more distinctly separated by breed type without background influence.
  • Figure 3: Few-shot adaptation for VLMs. (a) Prompt Tuning zhou2022learning. (b) Adapter-based methods gao2024clip. (c) Prompt Ensembling pratt2023does. (d) Prompt Logits Ensembling.
  • Figure 4: Local visualization of feature spaces. Left: Supervised vision backbones produce tightly clustered features within classes, with class centroids (stars) located at the center of the clusters. Right: Vision-Language Models (VLMs) capture contextual information across categories, forming multiple sub-clusters for each class, often displacing the centroids (stars) outside the main clusters. This difference highlights the richer contextual embeddings of VLMs compared to traditional supervised models.
  • Figure 5: An illustration of Cluster-Aware Prompt Ensemble Learning for VLMs. We first leverage GPT-3 to generate K prompts that describe Y training classes. The prompts are extracted as text features to initialize the classifiers. We combine the visual features and the classifiers to get $Y \times K$ logits. Through Cluster-Preserving Regularization (Figure \ref{['competition']}) and Adaptive Prompt Weighting, we obtain the prediction for a particular class.
  • ...and 6 more figures