Table of Contents
Fetching ...

Synergy and Diversity in CLIP: Enhancing Performance Through Adaptive Backbone Ensembling

Cristian Rodriguez-Opazo, Ehsan Abbasnejad, Damien Teney, Hamed Damirchi, Edison Marrese-Taylor, Anton van den Hengel

TL;DR

This paper investigates the diverse representations and robustness of CLIP backbones trained on the same data and objective, revealing substantial complementarity across architectures. It introduces Neural Logit Controller (NLC), an adaptive ensemble that learns per-backbone temperatures to weight logits conditioned on the input, using only a small labeled holdout. Across 21 datasets, NLC achieves up to 39.1% improvement over the best single backbone and averages around 9% gains, while remaining compatible with efficiency frameworks like Cascade. The results demonstrate that leveraging backbone diversity with input-aware weighting yields substantial accuracy gains, reduces computational load when selecting a subset of backbones, and complements existing few-shot adapters.

Abstract

Contrastive Language-Image Pretraining (CLIP) stands out as a prominent method for image representation learning. Various architectures, from vision transformers (ViTs) to convolutional networks (ResNets) have been trained with CLIP to serve as general solutions to diverse vision tasks. This paper explores the differences across various CLIP-trained vision backbones. Despite using the same data and training objective, we find that these architectures have notably different representations, different classification performance across datasets, and different robustness properties to certain types of image perturbations. Our findings indicate a remarkable possible synergy across backbones by leveraging their respective strengths. In principle, classification accuracy could be improved by over 40 percentage with an informed selection of the optimal backbone per test example.Using this insight, we develop a straightforward yet powerful approach to adaptively ensemble multiple backbones. The approach uses as few as one labeled example per class to tune the adaptive combination of backbones. On a large collection of datasets, the method achieves a remarkable increase in accuracy of up to 39.1% over the best single backbone, well beyond traditional ensembles

Synergy and Diversity in CLIP: Enhancing Performance Through Adaptive Backbone Ensembling

TL;DR

This paper investigates the diverse representations and robustness of CLIP backbones trained on the same data and objective, revealing substantial complementarity across architectures. It introduces Neural Logit Controller (NLC), an adaptive ensemble that learns per-backbone temperatures to weight logits conditioned on the input, using only a small labeled holdout. Across 21 datasets, NLC achieves up to 39.1% improvement over the best single backbone and averages around 9% gains, while remaining compatible with efficiency frameworks like Cascade. The results demonstrate that leveraging backbone diversity with input-aware weighting yields substantial accuracy gains, reduces computational load when selecting a subset of backbones, and complements existing few-shot adapters.

Abstract

Contrastive Language-Image Pretraining (CLIP) stands out as a prominent method for image representation learning. Various architectures, from vision transformers (ViTs) to convolutional networks (ResNets) have been trained with CLIP to serve as general solutions to diverse vision tasks. This paper explores the differences across various CLIP-trained vision backbones. Despite using the same data and training objective, we find that these architectures have notably different representations, different classification performance across datasets, and different robustness properties to certain types of image perturbations. Our findings indicate a remarkable possible synergy across backbones by leveraging their respective strengths. In principle, classification accuracy could be improved by over 40 percentage with an informed selection of the optimal backbone per test example.Using this insight, we develop a straightforward yet powerful approach to adaptively ensemble multiple backbones. The approach uses as few as one labeled example per class to tune the adaptive combination of backbones. On a large collection of datasets, the method achieves a remarkable increase in accuracy of up to 39.1% over the best single backbone, well beyond traditional ensembles
Paper Structure (35 sections, 32 figures, 24 tables)

This paper contains 35 sections, 32 figures, 24 tables.

Figures (32)

  • Figure 1: We propose a method to improve CLIP's effectiveness for image classification by combining the strengths of different backbones. (Left) For a given test image, the logits from different backbones are combined with a temperature scaling that weights their contribution to the final prediction. The scaling is implemented in the Neural Logit Controller (NLC, a small MLP) that is learned from as little as one labeled example per class. (Right) To reduce the computational load, our method can be combined with the Cascade framework wang2022wisdom.
  • Figure 2: (Y axis) Relative improvement of NLC over best backbone vs. (X axis) predictions diversity $^1$. NLC always improves over the best backbone. Moreover, the clear correlation shows that higher diversity in predictions tends to result in greater improvements with NLC.
  • Figure 4: Zero-shot classification accuracy of various CLIP models on 21 datasets, and of the upper-bound "Oracle" combination of ResNets (RN), ViTs, and all backbones.
  • Figure 5: Linear Venn diagrams showing the overlap of test images from ImageNet-1k correctly classified by different backbones (rows). Each column represents a subset of images correctly classified by a specific group of backbones (group size in column header). Row/column sums indicate the number of correct predictions per backbone/subset. We observe that (1) different backbones agree on a large part of the data. (2) They also make additional correct predictions on different subsets. (3) Accuracy usually grows with architectures size, but even within a same family (ViTs, ResNets), different models show different patterns of (in)correct predictions.
  • Figure 6: Diversity of predictions of (1) different Backbones with same pre-trained dataset, (2) same backbone with different pre-trained Datasets, and (3) same backbone and same pre-trained dataset in two different Epochs. Results show complementarity is higher when we combine different backbones.
  • ...and 27 more figures