Table of Contents
Fetching ...

CLIP Is Shortsighted: Paying Attention Beyond the First Sentence

Marc-Antoine Lavoie, Anas Mahmoud, Aldo Zaimi, Arsene Fansi Tchango, Steven L. Waslander

TL;DR

DeBias-CLIP is introduced, which removes the summary sentence during training and applies sentence sub-sampling and text token padding to distribute supervision across all token positions and achieves state-of-the-art long-text retrieval, improves short-text retrieval, and is less sensitive to sentence order permutations.

Abstract

CLIP models learn transferable multi-modal features via image-text contrastive learning on internet-scale data. They are widely used in zero-shot classification, multi-modal retrieval, text-to-image diffusion, and as image encoders in large vision-language models. However, CLIP's pretraining is dominated by images paired with short captions, biasing the model toward encoding simple descriptions of salient objects and leading to coarse alignment on complex scenes and dense descriptions. While recent work mitigates this by fine-tuning on small-scale long-caption datasets, we identify an important common bias: both human- and LLM-generated long captions typically begin with a one-sentence summary followed by a detailed description. We show that this acts as a shortcut during training, concentrating attention on the opening sentence and early tokens and weakening alignment over the rest of the caption. To resolve this, we introduce DeBias-CLIP, which removes the summary sentence during training and applies sentence sub-sampling and text token padding to distribute supervision across all token positions. DeBias-CLIP achieves state-of-the-art long-text retrieval, improves short-text retrieval, and is less sensitive to sentence order permutations. It is a drop-in replacement for Long-CLIP with no additional trainable parameters.

CLIP Is Shortsighted: Paying Attention Beyond the First Sentence

TL;DR

DeBias-CLIP is introduced, which removes the summary sentence during training and applies sentence sub-sampling and text token padding to distribute supervision across all token positions and achieves state-of-the-art long-text retrieval, improves short-text retrieval, and is less sensitive to sentence order permutations.

Abstract

CLIP models learn transferable multi-modal features via image-text contrastive learning on internet-scale data. They are widely used in zero-shot classification, multi-modal retrieval, text-to-image diffusion, and as image encoders in large vision-language models. However, CLIP's pretraining is dominated by images paired with short captions, biasing the model toward encoding simple descriptions of salient objects and leading to coarse alignment on complex scenes and dense descriptions. While recent work mitigates this by fine-tuning on small-scale long-caption datasets, we identify an important common bias: both human- and LLM-generated long captions typically begin with a one-sentence summary followed by a detailed description. We show that this acts as a shortcut during training, concentrating attention on the opening sentence and early tokens and weakening alignment over the rest of the caption. To resolve this, we introduce DeBias-CLIP, which removes the summary sentence during training and applies sentence sub-sampling and text token padding to distribute supervision across all token positions. DeBias-CLIP achieves state-of-the-art long-text retrieval, improves short-text retrieval, and is less sensitive to sentence order permutations. It is a drop-in replacement for Long-CLIP with no additional trainable parameters.
Paper Structure (42 sections, 10 equations, 13 figures, 12 tables, 1 algorithm)

This paper contains 42 sections, 10 equations, 13 figures, 12 tables, 1 algorithm.

Figures (13)

  • Figure 1: Key issues in Long-CLIP and our proposed DeBias-CLIP. a) CLIP models fine-tuned on long captions, such as Long-CLIP zhang2024long, are biased towards early tokens and expect a summary-like first sentence in captions to obtain good retrieval performance. On DOCCI text-to-image retrieval, swapping the first and fourth sentences of the long caption (Move) substantially degrades performance ($-9.7\%$), and removing the summary sentence (Remove) is even more detrimental ($-17.1\%$). b) Long-CLIP shows a steady decline in self-attention as a function of token depth, while our Debias-CLIP has consistent token attention. c) Our DeBias-CLIP method resolves these issues with three caption-level text augmentations: starting from the original caption, we (i) remove the opening summary sentence, (ii) randomly sample from the remaining sentences, and (iii) pad the tokenized sequence to increase exposure of later positional embeddings during training.
  • Figure 2: Top-1 text-to-image retrieval on DOCCI as a function of the number of added padding sentences. One to five padding sentences 'This is a photo.' are added before the truncated original DOCCI caption (we keep the first two sentences only). We use the ViT-B/16 scale for all models.
  • Figure 3: Top-1 image-to-text retrieval on DOCCI with first two sentences permuted. We analyze three setups: the first two sentences in the correct order (First 2), the same two sentences swapped (Swap 2), and the first sentence only (First only). Results are reported for four models: OpenAI CLIP, OpenCLIP (LAION-2B), SigLIP, and SigLIP2.
  • Figure 4: Top-1 text-to-image retrieval on DOCCI for different CLIP pretrained models. We consider 3 cases: Keep (full caption), Move (swap the first and fourth sentences), and Remove (drop the first sentence). We improve performance for all encoders.
  • Figure 5: Top-1 text-to-image retrieval on COCO (short) and DOCCI (long) as a function of short caption loss weight $\lambda^{s}$. Short retrieval peaks at $\lambda^{s}=0.1$, while long retrieval peaks at $\lambda^{s}=0.25$. Higher values degrade performance in both cases.
  • ...and 8 more figures