Table of Contents
Fetching ...

Learning Visual Composition through Improved Semantic Guidance

Austin Stone, Hagen Soltau, Robert Geirhos, Xi Yi, Ye Xia, Bingyi Cao, Kaifeng Chen, Abhijit Ogale, Jonathon Shlens

TL;DR

The paper tackles the limited ability of visual–semantic embeddings to capture image composition by advocating a data-centric, scalable approach within CLIP. It introduces grounded recaptioning and the use of strong pretrained text encoders to enrich semantic targets, achieving substantial gains on compositional benchmarks such as ARO and DOCCI without architectural changes. Results show strong improvements in image retrieval and zero-shot tasks, including high DOCCI recall and competitive zero-shot ImageNet performance, while revealing COCO’s limitations for assessing compositional understanding. The findings illustrate that improving the quality of captions and text guidance can unlock significant gains in multimodal representations, with practical impact on retrieval, grounding, and potential extensions to captioning systems.

Abstract

Visual imagery does not consist of solitary objects, but instead reflects the composition of a multitude of fluid concepts. While there have been great advances in visual representation learning, such advances have focused on building better representations for a small number of discrete objects bereft of an understanding of how these objects are interacting. One can observe this limitation in representations learned through captions or contrastive learning -- where the learned model treats an image essentially as a bag of words. Several works have attempted to address this limitation through the development of bespoke learned architectures to directly address the shortcomings in compositional learning. In this work, we focus on simple, and scalable approaches. In particular, we demonstrate that by substantially improving weakly labeled data, i.e. captions, we can vastly improve the performance of standard contrastive learning approaches. Previous CLIP models achieved near chance rate on challenging tasks probing compositional learning. However, our simple approach boosts performance of CLIP substantially and surpasses all bespoke architectures. Furthermore, we showcase our results on a relatively new captioning benchmark derived from DOCCI. We demonstrate through a series of ablations that a standard CLIP model trained with enhanced data may demonstrate impressive performance on image retrieval tasks.

Learning Visual Composition through Improved Semantic Guidance

TL;DR

The paper tackles the limited ability of visual–semantic embeddings to capture image composition by advocating a data-centric, scalable approach within CLIP. It introduces grounded recaptioning and the use of strong pretrained text encoders to enrich semantic targets, achieving substantial gains on compositional benchmarks such as ARO and DOCCI without architectural changes. Results show strong improvements in image retrieval and zero-shot tasks, including high DOCCI recall and competitive zero-shot ImageNet performance, while revealing COCO’s limitations for assessing compositional understanding. The findings illustrate that improving the quality of captions and text guidance can unlock significant gains in multimodal representations, with practical impact on retrieval, grounding, and potential extensions to captioning systems.

Abstract

Visual imagery does not consist of solitary objects, but instead reflects the composition of a multitude of fluid concepts. While there have been great advances in visual representation learning, such advances have focused on building better representations for a small number of discrete objects bereft of an understanding of how these objects are interacting. One can observe this limitation in representations learned through captions or contrastive learning -- where the learned model treats an image essentially as a bag of words. Several works have attempted to address this limitation through the development of bespoke learned architectures to directly address the shortcomings in compositional learning. In this work, we focus on simple, and scalable approaches. In particular, we demonstrate that by substantially improving weakly labeled data, i.e. captions, we can vastly improve the performance of standard contrastive learning approaches. Previous CLIP models achieved near chance rate on challenging tasks probing compositional learning. However, our simple approach boosts performance of CLIP substantially and surpasses all bespoke architectures. Furthermore, we showcase our results on a relatively new captioning benchmark derived from DOCCI. We demonstrate through a series of ablations that a standard CLIP model trained with enhanced data may demonstrate impressive performance on image retrieval tasks.

Paper Structure

This paper contains 19 sections, 1 equation, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Summary of results. Left: Previous state-of-the-art results in multimodal embeddings have limited understanding of the composition of images pmlr-v139-radford21ajia2021scalingvisualvisionlanguagerepresentation. Right: The goal of this work is to learn multimodal embeddings which reflect a strong understanding of the composition of visual and semantic information. Images and captions are red and blue, respectively.
  • Figure 2: Summary of methodology. All images were recaptioned using a multimodal foundation model grounded on the image and the alt-text for the image on the web page. In this case, we show captions generated using Gemini 1.5 Flash geminiteam2024gemini15unlockingmultimodal. We highlight aspects of the new caption in blue that leverage the alt-text or demonstrate a capability. Note how the generated captions leverage information provided by the alt-text or perform OCR on the original image to improve the caption. Captions are for images from CC-12M changpinyo2021cc12m.
  • Figure 3: Synthetic negative captions. We prompt a foundation model geminiteam2024gemini15unlockingmultimodal to generate 64 million synthetic positive and negative annotations. To generate the negative prompts, we provide few shot examples matching the style of ARO relations and attributes evaluation. Captions are for images from CC-12M changpinyo2021cc12m.
  • Figure 4: Recaptioning increases caption length by 8X. While the median alt text caption length is just 7 words, the detailed captions generated by Gemini Flash 1.5 increase this to 57 words.
  • Figure 5: Recaptioning improves caption log-likelihood. Alt text on the web is often unnatural (example: " bigtimerush nyc 007"), leading to low log-likelihood with a median of -223. In contrast, the captions from Gemini Flash 1.5 substantially improve median log-likelihood to -83, indicating that these captions are a lot closer to natural language and sentences than alt text.
  • ...and 2 more figures