Table of Contents
Fetching ...

Synth$^2$: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings

Sahand Sharifzadeh, Christos Kaplanis, Shreya Pathak, Dharshan Kumaran, Anastasija Ilic, Jovana Mitrovic, Charles Blundell, Andrea Banino

TL;DR

Synth$^2$ tackles the data bottleneck in visual-language modeling by generating fully synthetic caption–embedding pairs through a controlled pipeline that uses an LLM for captions and an image-embedding generator trained on the same data as the VLM. Training operates in embedding space to avoid costly pixel-space rendering while preserving performance, and a fair evaluation is ensured by pre-training both components on the same dataset. Empirical results show that synthetic data can match or exceed performance achieved with human-labeled data, with significant gains in efficiency and data utilization, and analyses highlight semantic diversity and balanced caption coverage as key factors. Overall, Synth$^2$ demonstrates a promising path toward scalable, self-improving multimodal models that rely less on expensive labeled data.

Abstract

The creation of high-quality human-labeled image-caption datasets presents a significant bottleneck in the development of Visual-Language Models (VLMs). In this work, we investigate an approach that leverages the strengths of Large Language Models (LLMs) and image generation models to create synthetic image-text pairs for efficient and effective VLM training. Our method employs a pretrained text-to-image model to synthesize image embeddings from captions generated by an LLM. Despite the text-to-image model and VLM initially being trained on the same data, our approach leverages the image generator's ability to create novel compositions, resulting in synthetic image embeddings that expand beyond the limitations of the original dataset. Extensive experiments demonstrate that our VLM, finetuned on synthetic data achieves comparable performance to models trained solely on human-annotated data, while requiring significantly less data. Furthermore, we perform a set of analyses on captions which reveals that semantic diversity and balance are key aspects for better downstream performance. Finally, we show that synthesizing images in the image embedding space is 25\% faster than in the pixel space. We believe our work not only addresses a significant challenge in VLM training but also opens up promising avenues for the development of self-improving multi-modal models.

Synth$^2$: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings

TL;DR

Synth tackles the data bottleneck in visual-language modeling by generating fully synthetic caption–embedding pairs through a controlled pipeline that uses an LLM for captions and an image-embedding generator trained on the same data as the VLM. Training operates in embedding space to avoid costly pixel-space rendering while preserving performance, and a fair evaluation is ensured by pre-training both components on the same dataset. Empirical results show that synthetic data can match or exceed performance achieved with human-labeled data, with significant gains in efficiency and data utilization, and analyses highlight semantic diversity and balanced caption coverage as key factors. Overall, Synth demonstrates a promising path toward scalable, self-improving multimodal models that rely less on expensive labeled data.

Abstract

The creation of high-quality human-labeled image-caption datasets presents a significant bottleneck in the development of Visual-Language Models (VLMs). In this work, we investigate an approach that leverages the strengths of Large Language Models (LLMs) and image generation models to create synthetic image-text pairs for efficient and effective VLM training. Our method employs a pretrained text-to-image model to synthesize image embeddings from captions generated by an LLM. Despite the text-to-image model and VLM initially being trained on the same data, our approach leverages the image generator's ability to create novel compositions, resulting in synthetic image embeddings that expand beyond the limitations of the original dataset. Extensive experiments demonstrate that our VLM, finetuned on synthetic data achieves comparable performance to models trained solely on human-annotated data, while requiring significantly less data. Furthermore, we perform a set of analyses on captions which reveals that semantic diversity and balance are key aspects for better downstream performance. Finally, we show that synthesizing images in the image embedding space is 25\% faster than in the pixel space. We believe our work not only addresses a significant challenge in VLM training but also opens up promising avenues for the development of self-improving multi-modal models.
Paper Structure (30 sections, 2 equations, 6 figures, 9 tables)

This paper contains 30 sections, 2 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: (A) Traditional dataset curation pipelines require a human in the loop to collect and annotate images. (B) We study whether we can reverse this pipeline with generative models, i.e. by first sampling synthetic captions from an LLM and then synthetically generating images from those. (C) By operating in the image embedding space, we also propose to bypass computationally expensive encoder/decoder steps, optimizing and integrating the process within VLM training.
  • Figure 2: Examples of synthetic captions and synthetic images generated by LLM and text-to-image generator.
  • Figure 3: We introduce a VLM framework that leverages LLMs and image generation models to create synthetic image-text pairs for efficient training. We can train a VLM from both non-synthetic (A) and synthetic (B) data as shown in this figure. Our model trained with the added synthetic pairs demonstrates impressive image captioning performance, significantly reducing the need for human annotated images.
  • Figure 4: Semantic diversity. Histogram represent the distribution of cluster sizes, with GenPair showing a more uniform coverage of semantic concepts. See \ref{['sec:app-cluster']} for details on how the histogram was derived.
  • Figure 5: Performance as a function of training steps. The blue curve shows the baseline trained solely on paired data (CCv2). The purple curve demonstrates Synth$^2$'s performance trained additionally on augmentation with fully synthetic data (GenPair). Synth$^2$ achieves parity with the baseline using roughly 1/3 of the training steps, showcasing its superior efficiency. Shaded regions represent standard deviation across 3 random seeds.
  • ...and 1 more figures