Table of Contents
Fetching ...

CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions

Yanqing Liu, Xianhang Li, Zeyu Wang, Bingchen Zhao, Cihang Xie

TL;DR

CLIPS tackles noisy web-crawled image-text data by leveraging richly descriptive synthetic captions through two complementary designs: using only short synthetic-caption fragments for contrastive learning and employing an autoregressive decoder to predict full synthetic captions in a recaptioning-like objective. The approach yields state-of-the-art zero-shot cross-modal retrieval on MSCOCO and Flickr30K and enhances MLLM benchmarks when deployed into LLaVA, demonstrating both higher effectiveness and efficiency. Key contributions include the discovery of an inverse-length effect for synthetic captions, a simple yet effective encoder strategy using a single short sentence, and a novel asymmetric generation objective that fully leverages synthetic captions. Together, these innovations advance vision-language pre-training with synthetic data and offer practical benefits for downstream multimodal systems.

Abstract

Previous works show that noisy, web-crawled image-text pairs may limit vision-language pretraining like CLIP and propose learning with synthetic captions as a promising alternative. Our work continues this effort, introducing two simple yet effective designs to better leverage richly described synthetic captions. Firstly, by observing a strong inverse effect in learning with synthetic captions -- the short synthetic captions can generally lead to MUCH higher performance than full-length ones -- we therefore fed only partial synthetic captions to the text encoder. Secondly, we incorporate an autoregressive captioner to mimic the recaptioning process -- by conditioning on the paired image input and web-crawled text description, the captioner learns to predict the full-length synthetic caption generated by advanced MLLMs. Experiments show that our framework significantly improves zero-shot performance in cross-modal retrieval tasks, setting new SOTA results on MSCOCO and Flickr30K. Moreover, such trained vision encoders can enhance the visual capability of LLaVA, showing strong improvements on a range of MLLM benchmarks. Our project page is https://ucsc-vlaa.github.io/CLIPS/.

CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions

TL;DR

CLIPS tackles noisy web-crawled image-text data by leveraging richly descriptive synthetic captions through two complementary designs: using only short synthetic-caption fragments for contrastive learning and employing an autoregressive decoder to predict full synthetic captions in a recaptioning-like objective. The approach yields state-of-the-art zero-shot cross-modal retrieval on MSCOCO and Flickr30K and enhances MLLM benchmarks when deployed into LLaVA, demonstrating both higher effectiveness and efficiency. Key contributions include the discovery of an inverse-length effect for synthetic captions, a simple yet effective encoder strategy using a single short sentence, and a novel asymmetric generation objective that fully leverages synthetic captions. Together, these innovations advance vision-language pre-training with synthetic data and offer practical benefits for downstream multimodal systems.

Abstract

Previous works show that noisy, web-crawled image-text pairs may limit vision-language pretraining like CLIP and propose learning with synthetic captions as a promising alternative. Our work continues this effort, introducing two simple yet effective designs to better leverage richly described synthetic captions. Firstly, by observing a strong inverse effect in learning with synthetic captions -- the short synthetic captions can generally lead to MUCH higher performance than full-length ones -- we therefore fed only partial synthetic captions to the text encoder. Secondly, we incorporate an autoregressive captioner to mimic the recaptioning process -- by conditioning on the paired image input and web-crawled text description, the captioner learns to predict the full-length synthetic caption generated by advanced MLLMs. Experiments show that our framework significantly improves zero-shot performance in cross-modal retrieval tasks, setting new SOTA results on MSCOCO and Flickr30K. Moreover, such trained vision encoders can enhance the visual capability of LLaVA, showing strong improvements on a range of MLLM benchmarks. Our project page is https://ucsc-vlaa.github.io/CLIPS/.

Paper Structure

This paper contains 27 sections, 10 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: The pipeline of our proposed CLIPS. We introduce two simple yet effective designs---1) only a subpart of the synthetic caption is used in contrastive learning and 2) a captioner to predict the full synthetic caption based on the web-crawled caption and the image---to better leverage synthetic captions. Our method registers new SOTA results on MSCOCO, achieving 76.4% in text retrieval and 57.2% in image retrieval.
  • Figure 2: Visualization of four different token reduction strategies. These strategies can improve the model's learning efficiency on synthetic captions to varying degrees. Among these strategies, the sub-caption and block mask perform best.
  • Figure 3: The inverse scaling effect of synthetic captions. Unlike the performance drop from reducing token length in original captions, shortening the token length of synthetic captions consistently improves model performance.
  • Figure 4: Ablation study on input and output token lengths. (a) pads a single sub-caption to different input lengths. (b) keeps the input valid token length constant and varies the target output token length. Performance is measured by R@1 of I→T on MSCOCO.