Table of Contents
Fetching ...

Beyond Sole Strength: Customized Ensembles for Generalized Vision-Language Models

Zhihe Lu, Jiawang Bai, Xin Li, Zeyu Xiao, Xinchao Wang

TL;DR

This work addresses open-world generalization for vision-language models by proposing ensembles of pre-trained CLIP variants. It introduces three tailored strategies—zero-shot ensemble (ZS_En), training-free ensemble (TF_En), and tuning ensemble (T_En)—to fuse logits or generate sample-aware weights via a SWIG module. Across 11 datasets and multiple generalization settings, the ensembles yield state-of-the-art performance, with notable gains when incorporating weaker models and using confidence-aware weighting. The findings demonstrate that ensemble-based approaches can surpass single-model improvements, offering a practical pathway to robust generalization in vision-language systems.

Abstract

Fine-tuning pre-trained vision-language models (VLMs), e.g., CLIP, for the open-world generalization has gained increasing popularity due to its practical value. However, performance advancements are limited when relying solely on intricate algorithmic designs for a single model, even one exhibiting strong performance, e.g., CLIP-ViT-B/16. This paper, for the first time, explores the collaborative potential of leveraging much weaker VLMs to enhance the generalization of a robust single model. The affirmative findings motivate us to address the generalization problem from a novel perspective, i.e., ensemble of pre-trained VLMs. We introduce three customized ensemble strategies, each tailored to one specific scenario. Firstly, we introduce the zero-shot ensemble, automatically adjusting the logits of different models based on their confidence when only pre-trained VLMs are available. Furthermore, for scenarios with extra few-shot samples, we propose the training-free and tuning ensemble, offering flexibility based on the availability of computing resources. The proposed ensemble strategies are evaluated on zero-shot, base-to-new, and cross-dataset generalization, achieving new state-of-the-art performance. Notably, this work represents an initial stride toward enhancing the generalization performance of VLMs via ensemble. The code is available at https://github.com/zhiheLu/Ensemble_VLM.git.

Beyond Sole Strength: Customized Ensembles for Generalized Vision-Language Models

TL;DR

This work addresses open-world generalization for vision-language models by proposing ensembles of pre-trained CLIP variants. It introduces three tailored strategies—zero-shot ensemble (ZS_En), training-free ensemble (TF_En), and tuning ensemble (T_En)—to fuse logits or generate sample-aware weights via a SWIG module. Across 11 datasets and multiple generalization settings, the ensembles yield state-of-the-art performance, with notable gains when incorporating weaker models and using confidence-aware weighting. The findings demonstrate that ensemble-based approaches can surpass single-model improvements, offering a practical pathway to robust generalization in vision-language systems.

Abstract

Fine-tuning pre-trained vision-language models (VLMs), e.g., CLIP, for the open-world generalization has gained increasing popularity due to its practical value. However, performance advancements are limited when relying solely on intricate algorithmic designs for a single model, even one exhibiting strong performance, e.g., CLIP-ViT-B/16. This paper, for the first time, explores the collaborative potential of leveraging much weaker VLMs to enhance the generalization of a robust single model. The affirmative findings motivate us to address the generalization problem from a novel perspective, i.e., ensemble of pre-trained VLMs. We introduce three customized ensemble strategies, each tailored to one specific scenario. Firstly, we introduce the zero-shot ensemble, automatically adjusting the logits of different models based on their confidence when only pre-trained VLMs are available. Furthermore, for scenarios with extra few-shot samples, we propose the training-free and tuning ensemble, offering flexibility based on the availability of computing resources. The proposed ensemble strategies are evaluated on zero-shot, base-to-new, and cross-dataset generalization, achieving new state-of-the-art performance. Notably, this work represents an initial stride toward enhancing the generalization performance of VLMs via ensemble. The code is available at https://github.com/zhiheLu/Ensemble_VLM.git.
Paper Structure (33 sections, 4 equations, 6 figures, 12 tables)

This paper contains 33 sections, 4 equations, 6 figures, 12 tables.

Figures (6)

  • Figure 1: Comparing existing methods on base-to-new generalization. The results indicate that the proposed method outperforms existing arts on 11 diverse datasets, often by large margins.
  • Figure 2: Zero-shot evaluation of pre-trained CLIP vision encoders on varying datasets. The bar charts show that the "weak" models may perform better than strong ones, e.g., RN50 vs. RN101 in (c, d) and RN101 vs. ViT-B/32 in (e, f), encouraging us to leverage diverse models for enhanced ensemble.
  • Figure 3: Illustration of our zero-shot ensemble (ZS$_{En}$). We assign a weight 1.0 to the best performing model, i.e., CLIP-ViT-B/16, and use the confidence-aware weights for other VLMs.
  • Figure 4: Illustration of our training-free ensemble (TF$_{En}$). We assign a weight 1.0 to the best performing model, i.e., CLIP-ViT-B/16, and determine the weights of other VLMs by greedy searching on a given "training" set without training.
  • Figure 5: Illustration of our tuning ensemble (T$_{En}$). The proposed sample-aware weight generator (SWIG) takes sample features as input to generate sample-aware weights, which are then used for weighted prediction.
  • ...and 1 more figures