Table of Contents
Fetching ...

Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies

Zichao Li, Cihang Xie, Ekin Dogus Cubuk

TL;DR

This study analyzes how Contrastive Language-Image Pre-Training (CLIP) performs when scaled down to limited compute, examining data quantity/quality, architecture choices between CNNs and Vision Transformers (ViTs), and training strategies (SLIP, FLIP, CLIP, CLIP+Data Augmentation). Using a large English WebLI subset, the authors show that high-quality, smaller datasets can outperform larger, lower-quality ones, and that data quality becomes increasingly important as budgets shrink. The findings reveal nuanced guidance: CNNs can be preferable at very small data budgets, ViTs shine with more data, and data augmentation can match or exceed CLIP performance with half the data, all while keeping computation in check. These insights offer practical, scalable pathways to deploy affordable CLIP models with strong zero-shot, linear probing, and retrieval capabilities across downstream tasks and distributions.

Abstract

This paper investigates the performance of the Contrastive Language-Image Pre-training (CLIP) when scaled down to limited computation budgets. We explore CLIP along three dimensions: data, architecture, and training strategies. With regards to data, we demonstrate the significance of high-quality training data and show that a smaller dataset of high-quality data can outperform a larger dataset with lower quality. We also examine how model performance varies with different dataset sizes, suggesting that smaller ViT models are better suited for smaller datasets, while larger models perform better on larger datasets with fixed compute. Additionally, we provide guidance on when to choose a CNN-based architecture or a ViT-based architecture for CLIP training. We compare four CLIP training strategies - SLIP, FLIP, CLIP, and CLIP+Data Augmentation - and show that the choice of training strategy depends on the available compute resource. Our analysis reveals that CLIP+Data Augmentation can achieve comparable performance to CLIP using only half of the training data. This work provides practical insights into how to effectively train and deploy CLIP models, making them more accessible and affordable for practical use in various applications.

Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies

TL;DR

This study analyzes how Contrastive Language-Image Pre-Training (CLIP) performs when scaled down to limited compute, examining data quantity/quality, architecture choices between CNNs and Vision Transformers (ViTs), and training strategies (SLIP, FLIP, CLIP, CLIP+Data Augmentation). Using a large English WebLI subset, the authors show that high-quality, smaller datasets can outperform larger, lower-quality ones, and that data quality becomes increasingly important as budgets shrink. The findings reveal nuanced guidance: CNNs can be preferable at very small data budgets, ViTs shine with more data, and data augmentation can match or exceed CLIP performance with half the data, all while keeping computation in check. These insights offer practical, scalable pathways to deploy affordable CLIP models with strong zero-shot, linear probing, and retrieval capabilities across downstream tasks and distributions.

Abstract

This paper investigates the performance of the Contrastive Language-Image Pre-training (CLIP) when scaled down to limited computation budgets. We explore CLIP along three dimensions: data, architecture, and training strategies. With regards to data, we demonstrate the significance of high-quality training data and show that a smaller dataset of high-quality data can outperform a larger dataset with lower quality. We also examine how model performance varies with different dataset sizes, suggesting that smaller ViT models are better suited for smaller datasets, while larger models perform better on larger datasets with fixed compute. Additionally, we provide guidance on when to choose a CNN-based architecture or a ViT-based architecture for CLIP training. We compare four CLIP training strategies - SLIP, FLIP, CLIP, and CLIP+Data Augmentation - and show that the choice of training strategy depends on the available compute resource. Our analysis reveals that CLIP+Data Augmentation can achieve comparable performance to CLIP using only half of the training data. This work provides practical insights into how to effectively train and deploy CLIP models, making them more accessible and affordable for practical use in various applications.
Paper Structure (16 sections, 27 figures)

This paper contains 16 sections, 27 figures.

Figures (27)

  • Figure 1: Training a High-Quality CLIP Model: This figure highlights the main contributions of our work. In \ref{['fig:all_flops']}, we demonstrate the relationship between different models, strategies, and error rates on the ImageNet dataset. The total computation is computed by GFLOPs per sample times the number sampled data. Additionally, in \ref{['fig:da']}, we illustrate how data augmentation methods improve the zero-shot performance of various datasets.
  • Figure 2: Data Quantity: Zero-Shot performances with the same dataset size across varied training epochs
  • Figure 3: Data Quantity: Few-Shot Performances on ImageNet.
  • Figure 4: Data Quantity: Retrieval Performances on MSCOCO Chen2015MicrosoftCC.
  • Figure 5: Data Quality: Zero-Shot Performances on ImageNet. \ref{['fig:qua_1e_img']} shows results trained for one epoch. \ref{['fig:qua_st_img']} shows results trained for the same number of sampled data. We use ViT-B/32 as the vision encoder and ViT-B as the text encoder.
  • ...and 22 more figures