SynthVLM: Towards High-Quality and Efficient Synthesis of Image-Caption Datasets for Vision-Language Models
Zheng Liu, Hao Liang, Bozhou Li, Wentao Xiong, Chong Chen, Conghui He, Wentao Zhang, Bin Cui
TL;DR
SynthVLM tackles the data quality, effectiveness, and efficiency bottlenecks in vision-language model training by introducing a two-stage synthetic data pipeline: first curate high-quality captions and generate high-resolution images with diffusion models, then select the best image-caption pairs using CLIPScore and SSIM. The authors curate SynthVLM-100K, a high-quality synthetic dataset that enables pretraining of 7B and 13B multimodal models, achieving state-of-the-art results on VQA benchmarks and MMLU with only 18% of the data used by real-world baselines. Ablation studies demonstrate that both the generation and selection stages are essential for performance gains, and the data selection step yields substantial efficiency improvements without sacrificing accuracy. Overall, SynthVLM provides a scalable path to high-fidelity, precisely aligned multimodal data, with strong real-world transfer and preserved language abilities, advancing practical training of multimodal models.
Abstract
Vision-Language Models (VLMs) have recently emerged, demonstrating remarkable vision-understanding capabilities. However, training these models requires large-scale datasets, which brings challenges related to efficiency, effectiveness, and quality of web data. In this paper, we introduce SynthVLM, a new data synthesis and curation method for generating image-caption pairs. Unlike traditional methods, where captions are generated from images, SynthVLM utilizes advanced diffusion models and high-quality captions to synthesize and select images from text captions, thereby creating precisely aligned image-text pairs. We further introduce SynthVLM-100K, a high-quality dataset consisting of 100K curated and synthesized image-caption pairs. In both model and human evaluations, SynthVLM-100K outperforms traditional real-world datasets. Leveraging this dataset, we develop a new family of multimodal large language models (MLLMs), SynthVLM-7B and SynthVLM-13B, which achieve state-of-the-art (SOTA) performance on various vision question-answering (VQA) tasks. Notably, our models outperform LLaVA across most metrics with only 18\% pretrain data. Furthermore, SynthVLM-7B and SynthVLM-13B attain SOTA performance on the MMLU benchmark, demonstrating that the high-quality SynthVLM-100K dataset preserves language abilities.
