Table of Contents
Fetching ...

CLIP Adaptation by Intra-modal Overlap Reduction

Alexey Kravets, Vinay Namboodiri

TL;DR

This work analyzes the intra-modal overlap in image space in terms of embedding representation and demonstrates that reducing the intra-modal overlap leads to improved performance on a number of standard datasets and increased robustness to distribution shift.

Abstract

Numerous methods have been proposed to adapt a pre-trained foundational CLIP model for few-shot classification. As CLIP is trained on a large corpus, it generalises well through adaptation to few-shot classification. In this work, we analyse the intra-modal overlap in image space in terms of embedding representation. Our analysis shows that, due to contrastive learning, embeddings from CLIP model exhibit high cosine similarity distribution overlap in the image space between paired and unpaired examples affecting the performance of few-shot training-free classification methods which rely on similarity in the image space for their predictions. To tackle intra-modal overlap we propose to train a lightweight adapter on a generic set of samples from the Google Open Images dataset demonstrating that this improves accuracy for few-shot training-free classification. We validate our contribution through extensive empirical analysis and demonstrate that reducing the intra-modal overlap leads to a) improved performance on a number of standard datasets, b) increased robustness to distribution shift and c) higher feature variance rendering the features more discriminative for downstream tasks.

CLIP Adaptation by Intra-modal Overlap Reduction

TL;DR

This work analyzes the intra-modal overlap in image space in terms of embedding representation and demonstrates that reducing the intra-modal overlap leads to improved performance on a number of standard datasets and increased robustness to distribution shift.

Abstract

Numerous methods have been proposed to adapt a pre-trained foundational CLIP model for few-shot classification. As CLIP is trained on a large corpus, it generalises well through adaptation to few-shot classification. In this work, we analyse the intra-modal overlap in image space in terms of embedding representation. Our analysis shows that, due to contrastive learning, embeddings from CLIP model exhibit high cosine similarity distribution overlap in the image space between paired and unpaired examples affecting the performance of few-shot training-free classification methods which rely on similarity in the image space for their predictions. To tackle intra-modal overlap we propose to train a lightweight adapter on a generic set of samples from the Google Open Images dataset demonstrating that this improves accuracy for few-shot training-free classification. We validate our contribution through extensive empirical analysis and demonstrate that reducing the intra-modal overlap leads to a) improved performance on a number of standard datasets, b) increased robustness to distribution shift and c) higher feature variance rendering the features more discriminative for downstream tasks.
Paper Structure (42 sections, 16 equations, 17 figures, 12 tables)

This paper contains 42 sections, 16 equations, 17 figures, 12 tables.

Figures (17)

  • Figure 1: Overview of the process. First, we perform a intra-modal overlap correction step of CLIP image encoder through adaptation. Then, this new image encoder is used to create intra-modal overlap corrected cache model that can be used in any training-free method improving its performance.
  • Figure 3: Intra-modal overlap measured as intersection area between cosine similarity distribution of paired and unpaired images using adapted and original CLIP image encoder (the lower the better)
  • Figure 4: Relation between IMO reduction vs average performance difference between TA++ and TA on fine-grained datasets.
  • Figure 5: Variance of features on ImageNet validation set of the original and adapted visual encoders.
  • Figure 6: T-SNE visualization of randomly chosen classes from ImageNet validation dataset using original (on the left) and adapted (on the right) visual features.
  • ...and 12 more figures