Table of Contents
Fetching ...

Modeling Caption Diversity in Contrastive Vision-Language Pretraining

Samuel Lavoie, Polina Kirichenko, Mark Ibrahim, Mahmoud Assran, Andrew Gordon Wilson, Aaron Courville, Nicolas Ballas

TL;DR

This work tackles the limitation of CLIP-style vision-language models that map images and captions to a single representation by introducing Latent Language Image Pretraining (Llip), which models caption diversity through caption-conditioned visual mixture tokens. By weighting multiple visual components via cross-attention driven by the target caption, Llip yields richer contextualized image representations and consistent zero-shot gains across classification and retrieval tasks. Comprehensive ablations demonstrate the value of caption-conditioned mixing over baseline SigLIP and CLIP approaches, with notable improvements on ImageNet zero-shot accuracy and MS-COCO retrieval. The approach scales with model size and maintains robustness across datasets and tasks, offering a simple yet effective enhancement to vision-language pretraining pipelines.

Abstract

There are a thousand ways to caption an image. Contrastive Language Pretraining (CLIP) on the other hand, works by mapping an image and its caption to a single vector -- limiting how well CLIP-like models can represent the diverse ways to describe an image. In this work, we introduce Llip, Latent Language Image Pretraining, which models the diversity of captions that could match an image. Llip's vision encoder outputs a set of visual features that are mixed into a final representation by conditioning on information derived from the text. We show that Llip outperforms non-contextualized baselines like CLIP and SigLIP on a variety of tasks even with large-scale encoders. Llip improves zero-shot classification by an average of 2.9% zero-shot classification benchmarks with a ViT-G/14 encoder. Specifically, Llip attains a zero-shot top-1 accuracy of 83.5% on ImageNet outperforming a similarly sized CLIP by 1.4%. We also demonstrate improvement on zero-shot retrieval on MS-COCO by 6.0%. We provide a comprehensive analysis of the components introduced by the method and demonstrate that Llip leads to richer visual representations.

Modeling Caption Diversity in Contrastive Vision-Language Pretraining

TL;DR

This work tackles the limitation of CLIP-style vision-language models that map images and captions to a single representation by introducing Latent Language Image Pretraining (Llip), which models caption diversity through caption-conditioned visual mixture tokens. By weighting multiple visual components via cross-attention driven by the target caption, Llip yields richer contextualized image representations and consistent zero-shot gains across classification and retrieval tasks. Comprehensive ablations demonstrate the value of caption-conditioned mixing over baseline SigLIP and CLIP approaches, with notable improvements on ImageNet zero-shot accuracy and MS-COCO retrieval. The approach scales with model size and maintains robustness across datasets and tasks, offering a simple yet effective enhancement to vision-language pretraining pipelines.

Abstract

There are a thousand ways to caption an image. Contrastive Language Pretraining (CLIP) on the other hand, works by mapping an image and its caption to a single vector -- limiting how well CLIP-like models can represent the diverse ways to describe an image. In this work, we introduce Llip, Latent Language Image Pretraining, which models the diversity of captions that could match an image. Llip's vision encoder outputs a set of visual features that are mixed into a final representation by conditioning on information derived from the text. We show that Llip outperforms non-contextualized baselines like CLIP and SigLIP on a variety of tasks even with large-scale encoders. Llip improves zero-shot classification by an average of 2.9% zero-shot classification benchmarks with a ViT-G/14 encoder. Specifically, Llip attains a zero-shot top-1 accuracy of 83.5% on ImageNet outperforming a similarly sized CLIP by 1.4%. We also demonstrate improvement on zero-shot retrieval on MS-COCO by 6.0%. We provide a comprehensive analysis of the components introduced by the method and demonstrate that Llip leads to richer visual representations.
Paper Structure (16 sections, 3 equations, 8 figures, 7 tables)

This paper contains 16 sections, 3 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: We propose Llip, Latent Language Image Pretraining, to model the diversity of matching captions for a given image. (a) Conceptual visualization of CLIP (left) and Llip (right) architectures. CLIP independently encodes visual features (shown in circles) and text features (shown in squares) which are pulled closer together by maximizing the cosine similarity objective $\mathcal{L}$. The single image feature vector of CLIP has to compromise between all matching text features (illustrated in the feature manifold at the bottom of the Figure). Llip outputs a set of visual mixture tokens which are combined into a final visual feature vector conditioned on the context derived from the caption. Llip's visual representations can more accurately represent each caption. (b) Zero-shot top-1 transfer accuracy averaged over 22 established classification benchmarks (see section \ref{['sec-zshclass']}) against Giga FLOPs for inference (estimated on the ImageNet zero-shot classification task) for encoders of various sizes. Llip outperforms the Visual Language Pretraining baselines. Llip was trained on the same data as MetaCLIP metaclip.
  • Figure 2: Summary of the method Llip. (a) Schema of Llip's computation of the loss. An image encoder outputs $K$mixture tokens ($K=2$ in the schema). The mixture tokens are given to a cross-attention module as keys and values along with the text encoding that is given as the query. The visual representation to be contrasted with the text target is conditioned on the text itself, allowing the model to produce a different visual representation depending on the caption. (b) Llip uses a contrastive objective and requires encoding the visual representation with the text targets to compute the loss.
  • Figure 3: Decomposing the effects of Llip's ingredients. Ablation of the added components of Llip compared to SigLIP and their effect on zero-shot ImageNet transfer accuracy. Every models are trained with a ViT-B/32. From left to right, we evaluate: 1) Re-implemented SigLIP baseline, 2) adding additional $63$ mixture tokens (+Registers darcet2023vision) which are not used in the final representation, 3) using uniform mixing of the learnable tokens (+Average), 4) non-uniform mixing of the tokens (+Learned average), 5) context-conditional mixing of the tokens (Llip$_{64}$). Conditioning the mixing weights of the tokens on the text feature achieves the best performance.
  • Figure 4: ImageNet zero-shot transfer classification. We compare a VIT-G/14 trained with $\text{Llip}_{64}$ with various vision-language baselines. We select the best reported number for every methods. Llip outperforms most of the vision-language pretraining baselines on ImageNet. Llip outperforms most of the. DFN, which is the only methods outperforming Llip, is trained on a larger datasets of 5B curated samples and use $378$ instead of $224$ as input image resolution. We report the imagenet performance of the baselines from: $^1$: cherti2023reproducible; $^2$: clip; $^3$: li2023clipav2; $^4$: sun2023evaclip; $^5$: siglip; $^6$: metaclip; $^7$dfn.
  • Figure 5: Llip's representation is more expressive than the non-contextualized SigLIP baselines. Singular value spectrum of the covariance matrix of the visual features of a ViT-B/32 using different pre-training objectives. The embedding vectors are taken at the output of the visual encoder. SigLIP with a learned query baseline adds $64$ mixture tokens and learns how to average them using a cross-attention with a learnable query vector. We concatenate the $64$ mixture tokens along the batch dimension for the learned query baseline and Llip. Llip show slower decay in the singular value spectrum than the two baselines which indicates a larger variability of the features.
  • ...and 3 more figures