Scalable Vision Language Model Training via High Quality Data Curation

Hongyuan Dong; Zijian Kang; Weijie Yin; Xiao Liang; Chao Feng; Jiao Ran

Scalable Vision Language Model Training via High Quality Data Curation

Hongyuan Dong, Zijian Kang, Weijie Yin, Xiao Liang, Chao Feng, Jiao Ran

TL;DR

This work introduces SAIL-VL, an open-source vision-language model series that achieves state-of-the-art results at 2B and 8B scales by combining a scalable, high-quality data curation pipeline (SAIL-Caption), large-scale pretraining up to 655B tokens, and a curriculum-based multi-stage visual instruction tuning strategy. The authors demonstrate data-size scaling laws for pretraining and show near-linear gains in SFT as data complexity increases, validating the importance of data quality and curriculum design for VLM performance. SAIL-VL-2B and 8B outperform existing baselines on 18 open benchmarks, and the work provides practical guidance for scalable VLM training, data curation, and evaluation in the community.

Abstract

In this paper, we introduce SAIL-VL (ScAlable Vision Language Model TraIning via High QuaLity Data Curation), an open-source vision language model (VLM) series achieving state-of-the-art (SOTA) performance in 2B and 8B parameters. The following three key improvements contribute to SAIL-VL's leading performance: (1) Scalable high-quality visual understanding data construction: We implement a data construction pipeline to enable hundred-million-scale high-quality recaption data annotation. The resulted dataset SAIL-Caption is validated to be of the highest data quality compared with opensource datasets. (2) Scalable Pretraining with High-Quality Visual Understanding Data: We scale SAIL-VL's pretraining budget up to 655B tokens and show that even a 2B VLM benefits from scaled up training data sizes, exhibiting logarithmic data size scaling laws in benchmark performance. (3) Scalable SFT via data quantity and complexity scaling: We curate a high-quality SFT dataset collection with leading data quantity scaling effectiveness and demonstrate that training with progressively higher-complexity data surpasses baseline one-stage training by a large margin. SAIL-VL series models achieve the highest average score in 18 widely used VLM benchmarks in our evaluation, with the 2B model takes the top position over VLMs of comparable sizes on OpenCompass 2024 (https://rank.opencompass.org.cn/leaderboard-multimodal), demonstrating robust visual comprehension abilities. SAIL-VL series models are released at HuggingFace (https://huggingface.co/BytedanceDouyinContent).

Scalable Vision Language Model Training via High Quality Data Curation

TL;DR

Abstract

Paper Structure (65 sections, 11 figures, 11 tables)

This paper contains 65 sections, 11 figures, 11 tables.

Introduction
Model Training Pipeline
Pretrain
SFT
Towards Scalable VLM Training
Towards Scalable VLM Training
Scalable High-Quality Visual Understanding Data Construction
Data collection.
Reference data curation.
Captioner model training.
Scalable high-quality data construction.
SAIL-Caption.
Scalable VLM Pretraining with High-Quality Visual Understanding Data
Improving VLM Visual Understanding Performance via Data Size Scaling
Generalizing Visual Understanding Abilities to Instruction Following Tasks
...and 50 more sections

Figures (11)

Figure 1: SAIL-VL's overall data construction and model training pipeline, as well as data size scaling laws observed in our large-scale VLM training experiments.
Figure 2: Scaling curves of SAIL-VL-2B's performance dynamics in the pretrain-alignment (PT-Ali) stage. We show model performance on all understanding benchmarks, caption tasks and OCR tasks, respectively. "BMK Score” stands for average benchmark scores.
Figure 3: Scaling curves of SAIL-VL-2B's performance dynamics in the pretrain-advance (PT-Adv) stage. We show pretrained and SFT model performance on understanding benchmarks and OS (opensource) VLM benchmarks, respectively. "BMK Score” stands for average benchmark scores.
Figure 4: Scaling curves of model performance trained on our SAIL-Instruct dataset, LLaVA-OneVision li2024llava single image SFT data, and datasets from Infinity-MM gu2024infinity. Model performance is shown as an average score across 18 benchmarks.
Figure 5: Model performance dynamics of the quality scaling and all-in-one (AIO) training strategy. "AIO learning” incorporates all three-stage SFT data into a single training loop. Model performance is shown as an average score across 18 benchmarks.
...and 6 more figures

Scalable Vision Language Model Training via High Quality Data Curation

TL;DR

Abstract

Scalable Vision Language Model Training via High Quality Data Curation

Authors

TL;DR

Abstract

Table of Contents

Figures (11)