Table of Contents
Fetching ...

What Makes CLIP More Robust to Long-Tailed Pre-Training Data? A Controlled Study for Transferable Insights

Xin Wen, Bingchen Zhao, Yilun Chen, Jiangmiao Pang, Xiaojuan Qi

TL;DR

This study uncovers the mechanisms behind CLIP's generalizability beyond data imbalance, and enables models trained on imbalanced data to achieve CLIP-level performance on diverse recognition tasks.

Abstract

Severe data imbalance naturally exists among web-scale vision-language datasets. Despite this, we find CLIP pre-trained thereupon exhibits notable robustness to the data imbalance compared to supervised learning, and demonstrates significant effectiveness in learning generalizable representations. With an aim to investigate the reasons behind this finding, we conduct controlled experiments to study various underlying factors, and reveal that CLIP's pretext task forms a dynamic classification problem wherein only a subset of classes is present in training. This isolates the bias from dominant classes and implicitly balances the learning signal. Furthermore, the robustness and discriminability of CLIP improve with more descriptive language supervision, larger data scale, and broader open-world concepts, which are inaccessible to supervised learning. Our study not only uncovers the mechanisms behind CLIP's generalizability beyond data imbalance but also provides transferable insights for the research community. The findings are validated in both supervised and self-supervised learning, enabling models trained on imbalanced data to achieve CLIP-level performance on diverse recognition tasks. Code and data are available at: https://github.com/CVMI-Lab/clip-beyond-tail.

What Makes CLIP More Robust to Long-Tailed Pre-Training Data? A Controlled Study for Transferable Insights

TL;DR

This study uncovers the mechanisms behind CLIP's generalizability beyond data imbalance, and enables models trained on imbalanced data to achieve CLIP-level performance on diverse recognition tasks.

Abstract

Severe data imbalance naturally exists among web-scale vision-language datasets. Despite this, we find CLIP pre-trained thereupon exhibits notable robustness to the data imbalance compared to supervised learning, and demonstrates significant effectiveness in learning generalizable representations. With an aim to investigate the reasons behind this finding, we conduct controlled experiments to study various underlying factors, and reveal that CLIP's pretext task forms a dynamic classification problem wherein only a subset of classes is present in training. This isolates the bias from dominant classes and implicitly balances the learning signal. Furthermore, the robustness and discriminability of CLIP improve with more descriptive language supervision, larger data scale, and broader open-world concepts, which are inaccessible to supervised learning. Our study not only uncovers the mechanisms behind CLIP's generalizability beyond data imbalance but also provides transferable insights for the research community. The findings are validated in both supervised and self-supervised learning, enabling models trained on imbalanced data to achieve CLIP-level performance on diverse recognition tasks. Code and data are available at: https://github.com/CVMI-Lab/clip-beyond-tail.
Paper Structure (46 sections, 1 equation, 21 figures, 2 tables)

This paper contains 46 sections, 1 equation, 21 figures, 2 tables.

Figures (21)

  • Figure 1: Per-class statistics of image-text datasets and models trained on top. (a) A highly imbalanced class distribution is shared across datasets.(b) Compared to supervised learning (✖ SL), CLIP's performance (measured by $$∙ accuracy) is less biased by data frequency, and the classifier is notably uncorrelated (measured by model's number of $$∙ prediction per class). Besides, the correlation narrows as data scales up. Both aspects indicate implicit re-balancing mechanisms exist in CLIP.
  • Figure 2: Curation process and distribution of datasets used in our controlled study. Top: IN-Caps fang2022incaps augments train images of ImageNet with texts by querying Flickr with image URLs. The texts include title, description, and tags. Bottom: LAIONet shirali2023laionet is a filtered subset of LAION-400M schuhmann2021laion400m, obtained by matching ImageNet classes with captions and filtering by CLIP text encoder for disambiguation.
  • Figure 3: Results on IN-Caps about $$∙ text descriptiveness and ✖ vocabulary size. 1) Increasing $$∙ text descriptiveness improves both robustness (a) and discriminability (b) of CLIP, but the tendency varies if using $$∙ less descriptive (template-based) supervision. 2) The gap between SL and CLIP (a) implies CLIP re-balances predictions, which is replicable by ✖ subsampling the vocabulary SL trains with.
  • Figure 4: Results on LAIONet about data distribution (level of data imbalance, distribution shift, and data diversity). 1) Extreme data imbalance makes models more prone to bias (last column vs. others). 2) Distribution shift ($$∙$$∙vs.$$∎$$∎, last column) harms discriminability but could improve robustness if pre-trained text head is used. 3) Higher data diversity (smaller threshold) also improves robustness.
  • Figure 5: Results on LAIONet subsets about data scale and text encoder. 1) CLIP's discriminability (a) and robustness (b) co-improve as data scales up, and can be boosted by pre-trained heads. 2) A frozen head helps CLIP preserve intra-class variation (c) while not harming margins (d), which can be lost if fine-tuned. It is also unattainable by SL even using the same head. 3) Language pre-training using CLIP is more favorable for image-text tasks than pure language modeling (e.g., RoBERTa liu2019roberta).
  • ...and 16 more figures