Table of Contents
Fetching ...

Low-hallucination Synthetic Captions for Large-Scale Vision-Language Model Pre-training

Xinsong Zhang, Yarong Zeng, Xinting Huang, Hu Hu, Runquan Xie, Han Hu, Zhanhui Kang

TL;DR

The paper tackles the data bottleneck in vision-language model pre-training by proposing a scalable pipeline for generating low-hallucination, knowledge-rich captions. It introduces Continuous DPO (CDPO) to suppress hallucinations and a knowledge-enriching SFT step to inject external knowledge, culminating in the Hunyuan-Recap100M dataset. Pre-training VLMs on this data yields consistent improvements across 15 VL tasks and 20 cognitive domains, with notable reductions in FID on both real-world and MSCOCO benchmarks, and enhanced perceptual capabilities. This approach offers a practical path to data-efficient multimodal learning and provides a substantial open-resource dataset, while acknowledging high compute costs and certain risk considerations.

Abstract

In recent years, the field of vision-language model pre-training has experienced rapid advancements, driven primarily by the continuous enhancement of textual capabilities in large language models. However, existing training paradigms for multimodal large language models heavily rely on high-quality image-text pairs. As models and data scales grow exponentially, the availability of such meticulously curated data has become increasingly scarce and saturated, thereby severely limiting further advancements in this domain. This study investigates scalable caption generation techniques for vision-language model pre-training and demonstrates that large-scale low-hallucination synthetic captions can serve dual purposes: 1) acting as a viable alternative to real-world data for pre-training paradigms and 2) achieving superior performance enhancement when integrated into vision-language models through empirical validation. This paper presents following key contributions: 1) a novel pipeline for generating high-quality, low-hallucination, and knowledge-rich synthetic captions. Our continuous DPO methodology yields remarkable results in reducing hallucinations. Specifically, the non-hallucination caption rate on a held-out test set increases from 48.3% to 77.9% for a 7B-size model. 2) Comprehensive empirical validation reveals that our synthetic captions confer superior pre-training advantages over their counterparts. Across 15 vision language tasks, the model trained with our data achieves a significant performance gain of at least 6.2% compared to identical images with alt-text. In 20 common cognitive domains, the model trained with our data outperforms the alt-text data by at least 7.5%. Meanwhile, it also offers considerable support in the text-to-image domain. With our dataset, the FID score is reduced by 17.1 on a real-world validation benchmark and 13.3 on the MSCOCO validation benchmark.

Low-hallucination Synthetic Captions for Large-Scale Vision-Language Model Pre-training

TL;DR

The paper tackles the data bottleneck in vision-language model pre-training by proposing a scalable pipeline for generating low-hallucination, knowledge-rich captions. It introduces Continuous DPO (CDPO) to suppress hallucinations and a knowledge-enriching SFT step to inject external knowledge, culminating in the Hunyuan-Recap100M dataset. Pre-training VLMs on this data yields consistent improvements across 15 VL tasks and 20 cognitive domains, with notable reductions in FID on both real-world and MSCOCO benchmarks, and enhanced perceptual capabilities. This approach offers a practical path to data-efficient multimodal learning and provides a substantial open-resource dataset, while acknowledging high compute costs and certain risk considerations.

Abstract

In recent years, the field of vision-language model pre-training has experienced rapid advancements, driven primarily by the continuous enhancement of textual capabilities in large language models. However, existing training paradigms for multimodal large language models heavily rely on high-quality image-text pairs. As models and data scales grow exponentially, the availability of such meticulously curated data has become increasingly scarce and saturated, thereby severely limiting further advancements in this domain. This study investigates scalable caption generation techniques for vision-language model pre-training and demonstrates that large-scale low-hallucination synthetic captions can serve dual purposes: 1) acting as a viable alternative to real-world data for pre-training paradigms and 2) achieving superior performance enhancement when integrated into vision-language models through empirical validation. This paper presents following key contributions: 1) a novel pipeline for generating high-quality, low-hallucination, and knowledge-rich synthetic captions. Our continuous DPO methodology yields remarkable results in reducing hallucinations. Specifically, the non-hallucination caption rate on a held-out test set increases from 48.3% to 77.9% for a 7B-size model. 2) Comprehensive empirical validation reveals that our synthetic captions confer superior pre-training advantages over their counterparts. Across 15 vision language tasks, the model trained with our data achieves a significant performance gain of at least 6.2% compared to identical images with alt-text. In 20 common cognitive domains, the model trained with our data outperforms the alt-text data by at least 7.5%. Meanwhile, it also offers considerable support in the text-to-image domain. With our dataset, the FID score is reduced by 17.1 on a real-world validation benchmark and 13.3 on the MSCOCO validation benchmark.

Paper Structure

This paper contains 15 sections, 2 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Examples of the original caption, previous synthetic caption and our synthetic caption.
  • Figure 2: A GPT-4o Prompt for Knowledge-Rich Captions. The portion highlighted in red within the prompt emphasizes the instructions for knowledge injection. This prompt utilizes a three-step procedure, which is intended to help ensure the inclusion of a rich body of world knowledge in the generated caption.
  • Figure 3: The illustration of our recaptioning pipeline. The pipeline comprises the following key stages: Initial Data Generation with GPT-4o, Manual Review, Knowledge-Enriching SFT, and Continuous DPO.
  • Figure 4: Performance of different DPO strategies. The horizontal axis represents the training data scale, while the vertical axis depicts the proportion of hallucination-free captions on the validation set (in percentage). Initially, the CDPO strategy utilizes the same training data as DPO. Upon reaching a training scale of $218k$ data points, it incorporates an additional $139k$ new data instances.
  • Figure 5: Loss curve of vision-language model pre-trained with alt-text, Recap-DataComp-1B, Qwen2-VL-Zeroshot or Hunyuan-Recap100M. To ensure a fair comparison, all pre-trained models analyzed here utilized the same amount of data ($20M$) for pre-training. Furthermore, the four convergence curves presented are obtained by fine-tuning these models on an identical supervised dataset.
  • ...and 1 more figures