What Makes CLIP More Robust to Long-Tailed Pre-Training Data? A Controlled Study for Transferable Insights

Xin Wen; Bingchen Zhao; Yilun Chen; Jiangmiao Pang; Xiaojuan Qi

What Makes CLIP More Robust to Long-Tailed Pre-Training Data? A Controlled Study for Transferable Insights

Xin Wen, Bingchen Zhao, Yilun Chen, Jiangmiao Pang, Xiaojuan Qi

TL;DR

This study uncovers the mechanisms behind CLIP's generalizability beyond data imbalance, and enables models trained on imbalanced data to achieve CLIP-level performance on diverse recognition tasks.

Abstract

Severe data imbalance naturally exists among web-scale vision-language datasets. Despite this, we find CLIP pre-trained thereupon exhibits notable robustness to the data imbalance compared to supervised learning, and demonstrates significant effectiveness in learning generalizable representations. With an aim to investigate the reasons behind this finding, we conduct controlled experiments to study various underlying factors, and reveal that CLIP's pretext task forms a dynamic classification problem wherein only a subset of classes is present in training. This isolates the bias from dominant classes and implicitly balances the learning signal. Furthermore, the robustness and discriminability of CLIP improve with more descriptive language supervision, larger data scale, and broader open-world concepts, which are inaccessible to supervised learning. Our study not only uncovers the mechanisms behind CLIP's generalizability beyond data imbalance but also provides transferable insights for the research community. The findings are validated in both supervised and self-supervised learning, enabling models trained on imbalanced data to achieve CLIP-level performance on diverse recognition tasks. Code and data are available at: https://github.com/CVMI-Lab/clip-beyond-tail.

What Makes CLIP More Robust to Long-Tailed Pre-Training Data? A Controlled Study for Transferable Insights

TL;DR

This study uncovers the mechanisms behind CLIP's generalizability beyond data imbalance, and enables models trained on imbalanced data to achieve CLIP-level performance on diverse recognition tasks.

Abstract

Paper Structure (46 sections, 1 equation, 21 figures, 2 tables)

This paper contains 46 sections, 1 equation, 21 figures, 2 tables.

Introduction
Related work
What makes CLIP more robust to long-tailed pre-training data?
Setting
(Descriptive) language as supervision signal
Dynamic classification (using subsampled vocabulary) as pretext task
Data distribution (level of imbalance, web distribution shift, and intra-class diversity)
Data scaling (also achievable via language pre-training)
Utilization of open-world concepts
Understanding the feature distribution of CLIP pre-trained at scale
Acquiring CLIP-level generalization
Data-imbalanced learning: an extreme case
Empowering self-supervised learning in-the-wild at scale
Limitations, future work, and broader impacts
Concluding remarks
...and 31 more sections

Figures (21)

Figure 1: Per-class statistics of image-text datasets and models trained on top. (a) A highly imbalanced class distribution is shared across datasets.(b) Compared to supervised learning (✖ SL), CLIP's performance (measured by $$∙ accuracy) is less biased by data frequency, and the classifier is notably uncorrelated (measured by model's number of $$∙ prediction per class). Besides, the correlation narrows as data scales up. Both aspects indicate implicit re-balancing mechanisms exist in CLIP.
Figure 2: Curation process and distribution of datasets used in our controlled study. Top: IN-Caps fang2022incaps augments train images of ImageNet with texts by querying Flickr with image URLs. The texts include title, description, and tags. Bottom: LAIONet shirali2023laionet is a filtered subset of LAION-400M schuhmann2021laion400m, obtained by matching ImageNet classes with captions and filtering by CLIP text encoder for disambiguation.
Figure 3: Results on IN-Caps about $$∙ text descriptiveness and ✖ vocabulary size. 1) Increasing $$∙ text descriptiveness improves both robustness (a) and discriminability (b) of CLIP, but the tendency varies if using $$∙ less descriptive (template-based) supervision. 2) The gap between SL and CLIP (a) implies CLIP re-balances predictions, which is replicable by ✖ subsampling the vocabulary SL trains with.
Figure 4: Results on LAIONet about data distribution (level of data imbalance, distribution shift, and data diversity). 1) Extreme data imbalance makes models more prone to bias (last column vs. others). 2) Distribution shift ($$∙$$∙vs.$$∎$$∎, last column) harms discriminability but could improve robustness if pre-trained text head is used. 3) Higher data diversity (smaller threshold) also improves robustness.
Figure 5: Results on LAIONet subsets about data scale and text encoder. 1) CLIP's discriminability (a) and robustness (b) co-improve as data scales up, and can be boosted by pre-trained heads. 2) A frozen head helps CLIP preserve intra-class variation (c) while not harming margins (d), which can be lost if fine-tuned. It is also unattainable by SL even using the same head. 3) Language pre-training using CLIP is more favorable for image-text tasks than pure language modeling (e.g., RoBERTa liu2019roberta).
...and 16 more figures

What Makes CLIP More Robust to Long-Tailed Pre-Training Data? A Controlled Study for Transferable Insights

TL;DR

Abstract

What Makes CLIP More Robust to Long-Tailed Pre-Training Data? A Controlled Study for Transferable Insights

Authors

TL;DR

Abstract

Table of Contents

Figures (21)