Table of Contents
Fetching ...

Parrot Captions Teach CLIP to Spot Text

Yiqi Lin, Conghui He, Alex Jinpeng Wang, Bin Wang, Weijia Li, Mike Zheng Shou

TL;DR

CLIP's strong vision-language alignment is substantially biased by embedded visual text in large web datasets. The authors profile LAION-2B using OCR-based text spotting and CLIP-score analyses to quantify Co-Emb. text and Parrot Captions, finding that a large fraction of images contain visible text and captions frequently parrot that text. Training on parrot-caption–rich data increases text-reading capacity but harms zero-shot image-text generalization, revealing a trade-off between reading capability and semantic alignment. They propose text-oriented filtering as a simple fix and call for bias-aware data pipelines and perhaps new training objectives to mitigate text-spott ing bias across CLIP-like models.

Abstract

Despite CLIP being the foundation model in numerous vision-language applications, the CLIP suffers from a severe text spotting bias. Such bias causes CLIP models to `Parrot' the visual text embedded within images while disregarding the authentic visual semantics. We uncover that in the most popular image-text dataset LAION-2B, the captions also densely parrot (spell) the text embedded in images. Our analysis shows that around 50% of images are embedded with visual text content, and around 30% of captions words are in these embedded visual content. Based on such observation, we thoroughly inspect the different released versions of CLIP models and verify that the visual text is the dominant factor in measuring the LAION-style image-text similarity for these models. To examine whether these parrot captions shape the text spotting bias, we train a series of CLIP models with LAION subsets curated by different parrot-caption-oriented criteria. We show that training with parrot captions easily shapes such bias but harms the expected visual-language representation learning in CLIP models. This suggests that it is urgent to revisit either the design of CLIP-like models or the existing image-text dataset curation pipeline built on CLIP score filtering.

Parrot Captions Teach CLIP to Spot Text

TL;DR

CLIP's strong vision-language alignment is substantially biased by embedded visual text in large web datasets. The authors profile LAION-2B using OCR-based text spotting and CLIP-score analyses to quantify Co-Emb. text and Parrot Captions, finding that a large fraction of images contain visible text and captions frequently parrot that text. Training on parrot-caption–rich data increases text-reading capacity but harms zero-shot image-text generalization, revealing a trade-off between reading capability and semantic alignment. They propose text-oriented filtering as a simple fix and call for bias-aware data pipelines and perhaps new training objectives to mitigate text-spott ing bias across CLIP-like models.

Abstract

Despite CLIP being the foundation model in numerous vision-language applications, the CLIP suffers from a severe text spotting bias. Such bias causes CLIP models to `Parrot' the visual text embedded within images while disregarding the authentic visual semantics. We uncover that in the most popular image-text dataset LAION-2B, the captions also densely parrot (spell) the text embedded in images. Our analysis shows that around 50% of images are embedded with visual text content, and around 30% of captions words are in these embedded visual content. Based on such observation, we thoroughly inspect the different released versions of CLIP models and verify that the visual text is the dominant factor in measuring the LAION-style image-text similarity for these models. To examine whether these parrot captions shape the text spotting bias, we train a series of CLIP models with LAION subsets curated by different parrot-caption-oriented criteria. We show that training with parrot captions easily shapes such bias but harms the expected visual-language representation learning in CLIP models. This suggests that it is urgent to revisit either the design of CLIP-like models or the existing image-text dataset curation pipeline built on CLIP score filtering.
Paper Structure (25 sections, 17 figures, 16 tables, 1 algorithm)

This paper contains 25 sections, 17 figures, 16 tables, 1 algorithm.

Figures (17)

  • Figure 1: In LAION-2B schuhmann2022laion, image-text pairs with the Top-5% highest similarity score are most dominant by visual text! These samples have dense concurrent text appearing in captions and images (text form in pixels). We refer to their captions as Parrot Captions as they raise a question: Dose CLIP Simply Parroting Text in Images for Vision-Language Alignment? The concurrent text is spotted by the OCR model and highlighted with color in image-text pairs. (Best view in color)
  • Figure 2: Visualization of defined terminologies. Co-emb. text is highlighted in the caption with colors.
  • Figure 3: (a): Based on the OCR prediction results, the image-text pairs are divided into three types: (1,1) image without visual embedded text content; (1,1) the spotted text from the image has no concurrent text with the caption; (1,1) the spotted text at least share one concurrent word with the caption. The clusters are merged from 4000 into 100 for a better view. (b): In the clusters with high (1,1) ratio, the top CLIP score samples contain various text sources, such as posters, book covers, advertisements, TV show screenshots, and even PowerPoint slides.
  • Figure 4: (a): Visualization of the number of caption words and associated spotted concurrent words based on precise word matching. (b): Distribution of total area of concurrent words placed in the image and its ViT-B CLIP score. (c): Distribution of text size of the single concurrent word and other spotted word.
  • Figure 5: Left: Mean CLIP scores of image-text pairs with different text removal operations depicted in Sec \ref{['subsec:ab_remove']}, and the data are grouped by cluster the same as Fig. \ref{['fig:overall_stat']}. Right: Overall relative CLIP score distribution by comparing different text removal operations.
  • ...and 12 more figures