Accelerating Pre-training of Multimodal LLMs via Chain-of-Sight

Ziyuan Huang; Kaixiang Ji; Biao Gong; Zhiwu Qing; Qinglong Zhang; Kecheng Zheng; Jian Wang; Jingdong Chen; Ming Yang

Accelerating Pre-training of Multimodal LLMs via Chain-of-Sight

Ziyuan Huang, Kaixiang Ji, Biao Gong, Zhiwu Qing, Qinglong Zhang, Kecheng Zheng, Jian Wang, Jingdong Chen, Ming Yang

TL;DR

Chain-of-Sight addresses the high computational cost of pre-training multimodal LLMs by reducing the number of visual tokens processed during pre-training through multi-scale visual resamplers, while enabling a post-pretraining scaling that increases token granularity during fine-tuning. The method achieves up to a 16$\times$ token increase after pre-training and delivers around a 73$\%$ reduction in wall-clock pre-training time without sacrificing downstream performance, with 32 tokens during pre-training matching or surpassing models trained with 336 tokens throughout. It leverages coarse-to-fine token integration and a parameter-inflation initialization to maintain performance, and demonstrates competitive results across vision-language benchmarks with a lightweight fine-tuning regime (LoRA). The work highlights a practical path toward faster, scalable pre-training of MLLMs and motivates further exploration of multi-scale, token-aware bridging for cross-modal models.

Abstract

This paper introduces Chain-of-Sight, a vision-language bridge module that accelerates the pre-training of Multimodal Large Language Models (MLLMs). Our approach employs a sequence of visual resamplers that capture visual details at various spacial scales. This architecture not only leverages global and local visual contexts effectively, but also facilitates the flexible extension of visual tokens through a compound token scaling strategy, allowing up to a 16x increase in the token count post pre-training. Consequently, Chain-of-Sight requires significantly fewer visual tokens in the pre-training phase compared to the fine-tuning phase. This intentional reduction of visual tokens during pre-training notably accelerates the pre-training process, cutting down the wall-clock training time by ~73%. Empirical results on a series of vision-language benchmarks reveal that the pre-train acceleration through Chain-of-Sight is achieved without sacrificing performance, matching or surpassing the standard pipeline of utilizing all visual tokens throughout the entire training process. Further scaling up the number of visual tokens for pre-training leads to stronger performances, competitive to existing approaches in a series of benchmarks.

Accelerating Pre-training of Multimodal LLMs via Chain-of-Sight

TL;DR

token increase after pre-training and delivers around a 73

reduction in wall-clock pre-training time without sacrificing downstream performance, with 32 tokens during pre-training matching or surpassing models trained with 336 tokens throughout. It leverages coarse-to-fine token integration and a parameter-inflation initialization to maintain performance, and demonstrates competitive results across vision-language benchmarks with a lightweight fine-tuning regime (LoRA). The work highlights a practical path toward faster, scalable pre-training of MLLMs and motivates further exploration of multi-scale, token-aware bridging for cross-modal models.

Abstract

Paper Structure (15 sections, 1 equation, 5 figures, 10 tables)

This paper contains 15 sections, 1 equation, 5 figures, 10 tables.

Introduction
Method
Re-examining the efficiency bottleneck in MLLM pre-training
Multi-scale visual resamplers
Post-pretrain token scaling strategy
Experiments
Experimental setup
Ablations
Comparison with existing approaches
Related work
Discussions
Details on multi-level feature aggregation
Details on training data and evaluation benchmarks
Detailed training settings
Further results

Figures (5)

Figure 1: Chain-of-Sight concept overview. Recent current MLLMs maintain a constant set of visual tokens in both pre-training and fine-tuning. These tokens typically represent visual contents at a single visual scale. In contrast, our Chain-of-Sight approach leverages the idea of visual hierarchy, producing multi-scale visual tokens. Moreover, the token scaling strategy enabled by our multi-scale visual resamplers allow us to start with a small pool of visual tokens for pre-training, before increasing the number of tokens during fine-tuning. This considerably accelerates the pre-training phase.
Figure 2: The Chain-of-Sight framework. Through partitioning visual features into windows and restricting cross-attention to the windowed features associated with the learnable tokens, our Chain-of-Sight approach produces visual tokens that encompass multiple scales. Thanks to the post-pretrain token scaling strategy, Chain-of-Sight reduces the required number of visual tokens in pre-training, thus accelerating the process. In contrast, the number of visual tokens remains constant in resampler-based methods blip2flamingoqwenvlye2023mplugowl2 for pre-training and fine-tuning, and the linear-layer liu2023llava15lu2024deepseekwang2023cogvlmchen2023pali3 produce a large number of visual tokens, incurring a high cost for pre-training.
Figure 3: Detailed illustration of our post-pretrain token scaling strategy.
Figure 4: Pre-train acceleration by Chain-of-Sight, in comparison with standard resamplers. The average performance is computed over the reported benchmarks in Table \ref{['tab:comparison_cap']}. Our method achieves a pre-train acceleration of 73% without compromising performance.
Figure A1: Multi-level feature aggregation in the multi-scale visual resamplers of Chain-of-Sight.

Accelerating Pre-training of Multimodal LLMs via Chain-of-Sight

TL;DR

Abstract

Accelerating Pre-training of Multimodal LLMs via Chain-of-Sight

Authors

TL;DR

Abstract

Table of Contents

Figures (5)