Table of Contents
Fetching ...

Exploring Weak-to-Strong Generalization for CLIP-based Classification

Jinhao Li, Sarah M. Erfani, Lei Feng, James Bailey, Feng Liu

TL;DR

This work investigates weak-to-strong generalization for CLIP-based classification in vision-language models, addressing alignment with user intent under limited human supervision. It introduces Class Prototype Learning (CPL), a method that learns per-class prototypes to represent robust class concepts using weak supervision, avoiding the need to fine-tune the text encoder. Across DomainNet and related benchmarks, CPL achieves an average accuracy of 64.74% and a notable 3.67% improvement over strong baselines, with substantial gains in challenging domains like Infograph and QuickDraw. The findings demonstrate that leveraging a weaker model to supervise a stronger one can effectively guide multimodal classifiers, offering an efficient pathway toward improved alignment when pretraining resources are constrained.

Abstract

Aligning large-scale commercial models with user intent is crucial to preventing harmful outputs. Current methods rely on human supervision but become impractical as model complexity increases. When models surpass human knowledge, providing accurate feedback becomes challenging and inefficient. A novel solution proposed recently is using a weaker model to supervise a stronger model. This concept leverages the ability of weaker models to perform evaluations, thereby reducing the workload on human supervisors. Previous work has shown the effectiveness of weak-to-strong generalization in the context of language-only models. Extending this concept to vision-language models leverages these insights, adapting the proven benefits to a multi-modal context. In our study, we explore weak-to-strong generalization for CLIP-based classification. We propose a method, class prototype learning (CPL), which aims to enhance the classification capabilities of the CLIP model, by learning more representative prototypes for each category. Our findings indicate that, despite using a simple loss function under weak supervision, CPL yields robust improvements in targeted scenarios, particularly when pretraining is limited. Extensive experiments demonstrate that our approach is effective under these settings, achieving a 3.67% improvement over strong baseline methods.

Exploring Weak-to-Strong Generalization for CLIP-based Classification

TL;DR

This work investigates weak-to-strong generalization for CLIP-based classification in vision-language models, addressing alignment with user intent under limited human supervision. It introduces Class Prototype Learning (CPL), a method that learns per-class prototypes to represent robust class concepts using weak supervision, avoiding the need to fine-tune the text encoder. Across DomainNet and related benchmarks, CPL achieves an average accuracy of 64.74% and a notable 3.67% improvement over strong baselines, with substantial gains in challenging domains like Infograph and QuickDraw. The findings demonstrate that leveraging a weaker model to supervise a stronger one can effectively guide multimodal classifiers, offering an efficient pathway toward improved alignment when pretraining resources are constrained.

Abstract

Aligning large-scale commercial models with user intent is crucial to preventing harmful outputs. Current methods rely on human supervision but become impractical as model complexity increases. When models surpass human knowledge, providing accurate feedback becomes challenging and inefficient. A novel solution proposed recently is using a weaker model to supervise a stronger model. This concept leverages the ability of weaker models to perform evaluations, thereby reducing the workload on human supervisors. Previous work has shown the effectiveness of weak-to-strong generalization in the context of language-only models. Extending this concept to vision-language models leverages these insights, adapting the proven benefits to a multi-modal context. In our study, we explore weak-to-strong generalization for CLIP-based classification. We propose a method, class prototype learning (CPL), which aims to enhance the classification capabilities of the CLIP model, by learning more representative prototypes for each category. Our findings indicate that, despite using a simple loss function under weak supervision, CPL yields robust improvements in targeted scenarios, particularly when pretraining is limited. Extensive experiments demonstrate that our approach is effective under these settings, achieving a 3.67% improvement over strong baseline methods.

Paper Structure

This paper contains 27 sections, 6 equations, 2 figures, 7 tables, 1 algorithm.

Figures (2)

  • Figure 1: Overview of the weak-to-strong process for enhancing strong model performance using weak model supervision. Unlabeled data from a given task is fed into both a strong model (CLIP) and a weak model. The strong model uses an image encoder to generate image features ($\bm r_i$), which are compared with learnable class prototypes ($\bm C_{1,:}, \bm C_{2,:}, ..., \bm C_{k,:}$) through cosine similarity to produce strong logits. Concurrently, the weak model generates weak logits from the same data. Our alignment loss ($L_{\text{CPL\xspace}}$ in Eq. \ref{['eq:cpl']}) is computed between the strong logits (based on the prototype matrix $\bm C$) and weak logits. For test data, the image features ($\bm r'_i$) extracted from the strong model $f^\text{s}$ are compared with the learned prototype matrix $\bm C^\ast$ to make predictions, aiming to improve the strong model's classification performance in the given task.
  • Figure 2: Comparison of train and test accuracy metrics over training steps. It shows the comparison of train (a) and test (b) accuracy metrics for different methods over training steps. Methods include AuxConf+TP, Ours, and AuxConf+LP. "Ours" demonstrates the highest accuracy, nearing the ceiling performance (y = 0.7427) and surpassing weak performance (y = 0.6715) in both the training and testing phases.