Table of Contents
Fetching ...

CLIP with Quality Captions: A Strong Pretraining for Vision Tasks

Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri, Oncel Tuzel

TL;DR

This work investigates why CLIP pretraining underperforms on dense vision tasks and identifies caption quality as a key lever. By pretraining CLIP on datasets with higher-quality, better-aligned captions (DataComp and DataCompDR), the authors demonstrate substantial improvements in semantic segmentation and depth estimation, with competitive or superior results compared to MAE and MAWS at scale. The findings extend to mobile architectures, where CLIP pretraining yields strong accuracy-efficiency trade-offs. Overall, the study emphasizes the critical role of data quality in foundation-model pretraining and provides practical guidance for achieving strong dense-prediction performance.

Abstract

CLIP models perform remarkably well on zero-shot classification and retrieval tasks. But recent studies have shown that learnt representations in CLIP are not well suited for dense prediction tasks like object detection, semantic segmentation or depth estimation. More recently, multi-stage training methods for CLIP models was introduced to mitigate the weak performance of CLIP on downstream tasks. In this work, we find that simply improving the quality of captions in image-text datasets improves the quality of CLIP's visual representations, resulting in significant improvement on downstream dense prediction vision tasks. In fact, we find that CLIP pretraining with good quality captions can surpass recent supervised, self-supervised and weakly supervised pretraining methods. We show that when CLIP model with ViT-B/16 as image encoder is trained on well aligned image-text pairs it obtains 12.1% higher mIoU and 11.5% lower RMSE on semantic segmentation and depth estimation tasks over recent state-of-the-art Masked Image Modeling (MIM) pretraining methods like Masked Autoencoder (MAE). We find that mobile architectures also benefit significantly from CLIP pretraining. A recent mobile vision architecture, MCi2, with CLIP pretraining obtains similar performance as Swin-L, pretrained on ImageNet-22k for semantic segmentation task while being 6.1$\times$ smaller. Moreover, we show that improving caption quality results in $10\times$ data efficiency when finetuning for dense prediction tasks.

CLIP with Quality Captions: A Strong Pretraining for Vision Tasks

TL;DR

This work investigates why CLIP pretraining underperforms on dense vision tasks and identifies caption quality as a key lever. By pretraining CLIP on datasets with higher-quality, better-aligned captions (DataComp and DataCompDR), the authors demonstrate substantial improvements in semantic segmentation and depth estimation, with competitive or superior results compared to MAE and MAWS at scale. The findings extend to mobile architectures, where CLIP pretraining yields strong accuracy-efficiency trade-offs. Overall, the study emphasizes the critical role of data quality in foundation-model pretraining and provides practical guidance for achieving strong dense-prediction performance.

Abstract

CLIP models perform remarkably well on zero-shot classification and retrieval tasks. But recent studies have shown that learnt representations in CLIP are not well suited for dense prediction tasks like object detection, semantic segmentation or depth estimation. More recently, multi-stage training methods for CLIP models was introduced to mitigate the weak performance of CLIP on downstream tasks. In this work, we find that simply improving the quality of captions in image-text datasets improves the quality of CLIP's visual representations, resulting in significant improvement on downstream dense prediction vision tasks. In fact, we find that CLIP pretraining with good quality captions can surpass recent supervised, self-supervised and weakly supervised pretraining methods. We show that when CLIP model with ViT-B/16 as image encoder is trained on well aligned image-text pairs it obtains 12.1% higher mIoU and 11.5% lower RMSE on semantic segmentation and depth estimation tasks over recent state-of-the-art Masked Image Modeling (MIM) pretraining methods like Masked Autoencoder (MAE). We find that mobile architectures also benefit significantly from CLIP pretraining. A recent mobile vision architecture, MCi2, with CLIP pretraining obtains similar performance as Swin-L, pretrained on ImageNet-22k for semantic segmentation task while being 6.1 smaller. Moreover, we show that improving caption quality results in data efficiency when finetuning for dense prediction tasks.
Paper Structure (18 sections, 1 equation, 3 figures, 15 tables)

This paper contains 18 sections, 1 equation, 3 figures, 15 tables.

Figures (3)

  • Figure 1: Qualitative examples of captions in DataComp and DataCompDR datasets.
  • Figure 2: Data scaling trends for CLIP pretraining on DataComp and DataCompDR. All results are for a ViT-B/16 model. Improved caption quality results in better data efficiency for learning transferable representations.
  • Figure 3: Average attention distances of ViT-B/16 model trained on 3 different datasets with varying caption quality. The caption quality improves from left to right. There is noticable improvement in diversity of attention distances when CLIP models are trained on datasets with better captions.