Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources
Weizhi Wang, Yu Tian, Linjie Yang, Heng Wang, Xifeng Yan
TL;DR
Open-Qwen2VL demonstrates that compute-efficient pre-training of a fully open 2B multimodal LLM is achievable on modest academic hardware by integrating high-quality data filtering (MLLM-based SU metric), a low-to-high image-resolution design, and multimodal sequence packing. With a curated 29M image-text dataset assembled from CCS-CLIP, DataComp-DFN, LAION-CLIP, and MLM-Filter, the model achieves strong performance after 220 GPU hours of pre-training and 48 GPU hours of SFT, outperforming partially-open peers on multiple benchmarks while using only 0.36% of the tokens of Qwen2-VL. The work provides an end-to-end, open-source pipeline—from data curation to pre-training codebase, sequence-packing scripts, data in WebDataset format, and both base and instruction-tuned checkpoints—intended to democratize access to competitive multimodal capabilities. Overall, the results suggest that carefully designed data filtering, packing, and training infrastructure can bridge the gap between large-scale SOTA models and research groups with limited resources, accelerating reproducible progress in multimodal AI.
Abstract
The reproduction of state-of-the-art multimodal LLM pre-training faces barriers at every stage of the pipeline, including high-quality data filtering, multimodal data mixture strategies, sequence packing techniques, and training frameworks. We introduce Open-Qwen2VL, a fully open-source 2B-parameter Multimodal Large Language Model pre-trained efficiently on 29M image-text pairs using only 220 A100-40G GPU hours. Our approach employs low-to-high dynamic image resolution and multimodal sequence packing to significantly enhance pre-training efficiency. The training dataset was carefully curated using both MLLM-based filtering techniques (e.g., MLM-Filter) and conventional CLIP-based filtering methods, substantially improving data quality and training efficiency. The Open-Qwen2VL pre-training is conducted on academic level 8xA100-40G GPUs at UCSB on 5B packed multimodal tokens, which is 0.36% of 1.4T multimodal pre-training tokens of Qwen2-VL. The final instruction-tuned Open-Qwen2VL outperforms partially-open state-of-the-art MLLM Qwen2-VL-2B on various multimodal benchmarks of MMBench, SEEDBench, MMstar, and MathVista, indicating the remarkable training efficiency of Open-Qwen2VL. We open-source all aspects of our work, including compute-efficient and data-efficient training details, data filtering methods, sequence packing scripts, pre-training data in WebDataset format, FSDP-based training codebase, and both base and instruction-tuned model checkpoints. We redefine "fully open" for multimodal LLMs as the complete release of: 1) the training codebase, 2) detailed data filtering techniques, and 3) all pre-training and supervised fine-tuning data used to develop the model.
