Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models

Zhengfeng Lai; Vasileios Saveris; Chen Chen; Hong-You Chen; Haotian Zhang; Bowen Zhang; Juan Lao Tebar; Wenze Hu; Zhe Gan; Peter Grasch; Meng Cao; Yinfei Yang

Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models

Zhengfeng Lai, Vasileios Saveris, Chen Chen, Hong-You Chen, Haotian Zhang, Bowen Zhang, Juan Lao Tebar, Wenze Hu, Zhe Gan, Peter Grasch, Meng Cao, Yinfei Yang

TL;DR

This work proposes a novel, controllable, and scalable captioning pipeline designed to generate diverse caption formats tailored to various multimodal models, and reveals that a hybrid approach that keeps both synthetic captions and AltTexts can outperform the use of synthetic captions alone.

Abstract

Recent advancements in multimodal models highlight the value of rewritten captions for improving performance, yet key challenges remain. For example, while synthetic captions often provide superior quality and image-text alignment, it is not clear whether they can fully replace AltTexts: the role of synthetic captions and their interaction with original web-crawled AltTexts in pre-training is still not well understood. Moreover, different multimodal foundation models may have unique preferences for specific caption formats, but efforts to identify the optimal captions for each model remain limited. In this work, we propose a novel, controllable, and scalable captioning pipeline designed to generate diverse caption formats tailored to various multimodal models. By examining Short Synthetic Captions (SSC) towards Dense Synthetic Captions (DSC+) as case studies, we systematically explore their effects and interactions with AltTexts across models such as CLIP, multimodal LLMs, and diffusion models. Our findings reveal that a hybrid approach that keeps both synthetic captions and AltTexts can outperform the use of synthetic captions alone, improving both alignment and performance, with each model demonstrating preferences for particular caption formats. This comprehensive analysis provides valuable insights into optimizing captioning strategies, thereby advancing the pre-training of multimodal foundation models.

Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models

TL;DR

Abstract

Paper Structure (17 sections, 1 equation, 10 figures, 10 tables)

This paper contains 17 sections, 1 equation, 10 figures, 10 tables.

Introduction
Related Work
Customized Re-captioning for Multimodal Foundation Models
MLLM as An Image Describer
Two-Stage Human-Aligned Captioning
Caption Analysis
Image-Caption Data for Multimodal Foundation Models
Image-Caption Data for CLIP
Image-Caption Data for Multimodal LLM
Image-Caption Data for Diffusion Model
Discussion
Experimental Details
CLIP
Additional Experiments
Multimodal LLM
...and 2 more sections

Figures (10)

Figure 1: The role of image-text data in multimodal foundation models: a key component in training CLIP and Diffusion Model, and essential for multimodal LLM (MLLM) pre-training alongside text and interleaved image-text data. We propose a controllable captioning pipeline to synthesize different types of captions and explore optimal image-text data recipes for training these foundation models.
Figure 2: Zero-shot retrieval and classification performance of CLIP models. (a) The effect of synthetic captions (LLaVA recaptioned) and AltText: solely using LLaVA captions can improve retrieval tasks but significantly deteriorate the zero-shot classification performance. (b) The effect of different formats of synthetic captions on CLIP: Short Synthetic Captions (SSC) show superior results to Descriptive Synthetic Captions (DSC) and the combination of them achieves the best results.
Figure 3: Examples of controllable captions of diverse formats generated by our captioner: we can generate from brief to dense descriptions and fuse AltText into the caption (AFC).
Figure 4: Directly using MLLMs as image captioners may result in hallucinations and generate captions that do not align with specific instructions: both LLaVA llava and ShareGPT4V sharegpt4v generate over three sentences and obvious hallucination.
Figure 5: Overview of the controllable and human-aligned captioning pipeline. In Stage 1, we convert a pre-trained MLLM into a customized captioner that strictly follows the captioning instructions. In Stage 2, we leverage human-aligned captions to further fine-tune the captioner.
...and 5 more figures

Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models

TL;DR

Abstract

Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models

Authors

TL;DR

Abstract

Table of Contents

Figures (10)