Table of Contents
Fetching ...

SynthVLM: Towards High-Quality and Efficient Synthesis of Image-Caption Datasets for Vision-Language Models

Zheng Liu, Hao Liang, Bozhou Li, Wentao Xiong, Chong Chen, Conghui He, Wentao Zhang, Bin Cui

TL;DR

SynthVLM tackles the data quality, effectiveness, and efficiency bottlenecks in vision-language model training by introducing a two-stage synthetic data pipeline: first curate high-quality captions and generate high-resolution images with diffusion models, then select the best image-caption pairs using CLIPScore and SSIM. The authors curate SynthVLM-100K, a high-quality synthetic dataset that enables pretraining of 7B and 13B multimodal models, achieving state-of-the-art results on VQA benchmarks and MMLU with only 18% of the data used by real-world baselines. Ablation studies demonstrate that both the generation and selection stages are essential for performance gains, and the data selection step yields substantial efficiency improvements without sacrificing accuracy. Overall, SynthVLM provides a scalable path to high-fidelity, precisely aligned multimodal data, with strong real-world transfer and preserved language abilities, advancing practical training of multimodal models.

Abstract

Vision-Language Models (VLMs) have recently emerged, demonstrating remarkable vision-understanding capabilities. However, training these models requires large-scale datasets, which brings challenges related to efficiency, effectiveness, and quality of web data. In this paper, we introduce SynthVLM, a new data synthesis and curation method for generating image-caption pairs. Unlike traditional methods, where captions are generated from images, SynthVLM utilizes advanced diffusion models and high-quality captions to synthesize and select images from text captions, thereby creating precisely aligned image-text pairs. We further introduce SynthVLM-100K, a high-quality dataset consisting of 100K curated and synthesized image-caption pairs. In both model and human evaluations, SynthVLM-100K outperforms traditional real-world datasets. Leveraging this dataset, we develop a new family of multimodal large language models (MLLMs), SynthVLM-7B and SynthVLM-13B, which achieve state-of-the-art (SOTA) performance on various vision question-answering (VQA) tasks. Notably, our models outperform LLaVA across most metrics with only 18\% pretrain data. Furthermore, SynthVLM-7B and SynthVLM-13B attain SOTA performance on the MMLU benchmark, demonstrating that the high-quality SynthVLM-100K dataset preserves language abilities.

SynthVLM: Towards High-Quality and Efficient Synthesis of Image-Caption Datasets for Vision-Language Models

TL;DR

SynthVLM tackles the data quality, effectiveness, and efficiency bottlenecks in vision-language model training by introducing a two-stage synthetic data pipeline: first curate high-quality captions and generate high-resolution images with diffusion models, then select the best image-caption pairs using CLIPScore and SSIM. The authors curate SynthVLM-100K, a high-quality synthetic dataset that enables pretraining of 7B and 13B multimodal models, achieving state-of-the-art results on VQA benchmarks and MMLU with only 18% of the data used by real-world baselines. Ablation studies demonstrate that both the generation and selection stages are essential for performance gains, and the data selection step yields substantial efficiency improvements without sacrificing accuracy. Overall, SynthVLM provides a scalable path to high-fidelity, precisely aligned multimodal data, with strong real-world transfer and preserved language abilities, advancing practical training of multimodal models.

Abstract

Vision-Language Models (VLMs) have recently emerged, demonstrating remarkable vision-understanding capabilities. However, training these models requires large-scale datasets, which brings challenges related to efficiency, effectiveness, and quality of web data. In this paper, we introduce SynthVLM, a new data synthesis and curation method for generating image-caption pairs. Unlike traditional methods, where captions are generated from images, SynthVLM utilizes advanced diffusion models and high-quality captions to synthesize and select images from text captions, thereby creating precisely aligned image-text pairs. We further introduce SynthVLM-100K, a high-quality dataset consisting of 100K curated and synthesized image-caption pairs. In both model and human evaluations, SynthVLM-100K outperforms traditional real-world datasets. Leveraging this dataset, we develop a new family of multimodal large language models (MLLMs), SynthVLM-7B and SynthVLM-13B, which achieve state-of-the-art (SOTA) performance on various vision question-answering (VQA) tasks. Notably, our models outperform LLaVA across most metrics with only 18\% pretrain data. Furthermore, SynthVLM-7B and SynthVLM-13B attain SOTA performance on the MMLU benchmark, demonstrating that the high-quality SynthVLM-100K dataset preserves language abilities.
Paper Structure (39 sections, 9 equations, 9 figures, 10 tables)

This paper contains 39 sections, 9 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: We compared SynthVLM-100K with LLaVA-558K. In (a), generated image can avoid content such as watermarks and advertisements. In (b), the generated images better reflect the content of the captions. Additionally, the resolution of the generated images is higher than real images.
  • Figure 2: The pipeline of the SynthVLM data synthesis method is as follows: First, we filter high-quality image-caption pairs. Next, we synthesize high-quality data and subsequently filter them based on CLIPScore.
  • Figure 3: Our process and prompt design for match assessment using GPT4V. We consider various aspects, including the quality of the image and the match between the image and the caption. Based on this process, we compare SynthVLM with existing datasets from the model's perspective.
  • Figure 4: From (a), it is evident that synthetic images can avoid displaying real license plates and ticket information. In contrast, (b) contains actual license plates and ticket information, which can potentially lead to privacy issues.
  • Figure 5: TSNE visualizations of synthetic and real datasets for text and image modalities.
  • ...and 4 more figures