Should VLMs be Pre-trained with Image Data?
Sedrick Keh, Jean Mercat, Samir Yitzhak Gadre, Kushal Arora, Igor Vasiljevic, Benjamin Burchfiel, Shuran Song, Russ Tedrake, Thomas Kollar, Ludwig Schmidt, Achal Dave
TL;DR
This work challenges the conventional two-stage paradigm for vision-language model (VLM) pre-training by integrating image data into the pre-training phase itself. Through a large-scale study of 300 models across scales, data compositions, and pre-training progress, it shows that introducing visual information during a cooldown near 80% text pre-training yields benefits on vision-language benchmarks while preserving text performance, with an optimal visual-token fraction around 10–20% at 1B parameters. It also finds that pre-training with image data from scratch can harm both vision and text tasks, while instruction-tuning data in pre-training degrades vision-language performance but can help text tasks; fine-tuning requires a modest number of epochs (2–4) to balance vision and text outcomes. These results provide practical guidance for designing efficient, open-source VLM training pipelines and suggest that carefully timed, partially integrated pre-training can outperform traditional fully sequential approaches in real-world settings.
Abstract
Pre-trained LLMs that are further trained with image data perform well on vision-language tasks. While adding images during a second training phase effectively unlocks this capability, it is unclear how much of a gain or loss this two-step pipeline gives over VLMs which integrate images earlier into the training process. To investigate this, we train models spanning various datasets, scales, image-text ratios, and amount of pre-training done before introducing vision tokens. We then fine-tune these models and evaluate their downstream performance on a suite of vision-language and text-only tasks. We find that pre-training with a mixture of image and text data allows models to perform better on vision-language tasks while maintaining strong performance on text-only evaluations. On an average of 6 diverse tasks, we find that for a 1B model, introducing visual tokens 80% of the way through pre-training results in a 2% average improvement over introducing visual tokens to a fully pre-trained model.
