Table of Contents
Fetching ...

Should VLMs be Pre-trained with Image Data?

Sedrick Keh, Jean Mercat, Samir Yitzhak Gadre, Kushal Arora, Igor Vasiljevic, Benjamin Burchfiel, Shuran Song, Russ Tedrake, Thomas Kollar, Ludwig Schmidt, Achal Dave

TL;DR

This work challenges the conventional two-stage paradigm for vision-language model (VLM) pre-training by integrating image data into the pre-training phase itself. Through a large-scale study of 300 models across scales, data compositions, and pre-training progress, it shows that introducing visual information during a cooldown near 80% text pre-training yields benefits on vision-language benchmarks while preserving text performance, with an optimal visual-token fraction around 10–20% at 1B parameters. It also finds that pre-training with image data from scratch can harm both vision and text tasks, while instruction-tuning data in pre-training degrades vision-language performance but can help text tasks; fine-tuning requires a modest number of epochs (2–4) to balance vision and text outcomes. These results provide practical guidance for designing efficient, open-source VLM training pipelines and suggest that carefully timed, partially integrated pre-training can outperform traditional fully sequential approaches in real-world settings.

Abstract

Pre-trained LLMs that are further trained with image data perform well on vision-language tasks. While adding images during a second training phase effectively unlocks this capability, it is unclear how much of a gain or loss this two-step pipeline gives over VLMs which integrate images earlier into the training process. To investigate this, we train models spanning various datasets, scales, image-text ratios, and amount of pre-training done before introducing vision tokens. We then fine-tune these models and evaluate their downstream performance on a suite of vision-language and text-only tasks. We find that pre-training with a mixture of image and text data allows models to perform better on vision-language tasks while maintaining strong performance on text-only evaluations. On an average of 6 diverse tasks, we find that for a 1B model, introducing visual tokens 80% of the way through pre-training results in a 2% average improvement over introducing visual tokens to a fully pre-trained model.

Should VLMs be Pre-trained with Image Data?

TL;DR

This work challenges the conventional two-stage paradigm for vision-language model (VLM) pre-training by integrating image data into the pre-training phase itself. Through a large-scale study of 300 models across scales, data compositions, and pre-training progress, it shows that introducing visual information during a cooldown near 80% text pre-training yields benefits on vision-language benchmarks while preserving text performance, with an optimal visual-token fraction around 10–20% at 1B parameters. It also finds that pre-training with image data from scratch can harm both vision and text tasks, while instruction-tuning data in pre-training degrades vision-language performance but can help text tasks; fine-tuning requires a modest number of epochs (2–4) to balance vision and text outcomes. These results provide practical guidance for designing efficient, open-source VLM training pipelines and suggest that carefully timed, partially integrated pre-training can outperform traditional fully sequential approaches in real-world settings.

Abstract

Pre-trained LLMs that are further trained with image data perform well on vision-language tasks. While adding images during a second training phase effectively unlocks this capability, it is unclear how much of a gain or loss this two-step pipeline gives over VLMs which integrate images earlier into the training process. To investigate this, we train models spanning various datasets, scales, image-text ratios, and amount of pre-training done before introducing vision tokens. We then fine-tune these models and evaluate their downstream performance on a suite of vision-language and text-only tasks. We find that pre-training with a mixture of image and text data allows models to perform better on vision-language tasks while maintaining strong performance on text-only evaluations. On an average of 6 diverse tasks, we find that for a 1B model, introducing visual tokens 80% of the way through pre-training results in a 2% average improvement over introducing visual tokens to a fully pre-trained model.

Paper Structure

This paper contains 42 sections, 14 figures, 9 tables.

Figures (14)

  • Figure 1: An overview of our VLM pre-training data recipe. We investigate data mixes and design choices for text-only pre-training, image-text pre-training, and fine-tuning. Note that while we depict "LLM Pre-training" and "Image-text Pre-training" as two separate steps in this diagram, in practice, we continuously transition from the first stage to the second.
  • Figure 2: The commonly used framework we apply to add vision capabilities to a transformer model.
  • Figure 3: Representation of the different learning rate schedules used for our experiments. 'Main schedule' corresponds to the learning rate for the initial, text-only pretraining. Other colored schedules are the ones used for image-text training and extend over 28B tokens each. They have been upscaled and appear as extending over 280B tokens for readability.
  • Figure 4: Varying the length of text-only pre-training. We analyze the impact of adding image data after varying amounts of text-only pre-training, showing results on vision benchmarks (green) and text benchmarks (blue). On the left, we show results across a suite of vision and text benchmarks; on the right, we plot two common benchmarks, VQA-v2 and ARC-easy. Introducing images at around 80% of the way through training performs best, maintaining high vision and text task performance. Note: The points at 100% are marked with hollow circles to highlight that they are trained with a different learning rate schedule, as shown in Figure \ref{['fig:learning_rate_schedules']}
  • Figure 5: Varying the ratio of image to text data, after some text-only pretraining. We analyze the impact of the ratio of image to text data in pre-training, after the model has seen text-only data for most of pre-training (80%). Unlike when training from scratch (Figure \ref{['fig:from_scratch_image_ratio']}), we find that adding vision data significantly helps vision performance, while maintaining high text accuracy.
  • ...and 9 more figures