Table of Contents
Fetching ...

SOLO: A Single Transformer for Scalable Vision-Language Modeling

Yangyi Chen, Xingyao Wang, Hao Peng, Heng Ji

TL;DR

This work argues that a unified Transformer architecture can overcome scalability limits faced by multi-component LVLMs by removing the dependence on pre-trained visual encoders. The authors introduce SOLO, a 7B LVLM initialized from Mistral-7B and trained via a three-stage pre-training curriculum on ImageNet, web-scale data, and an annealing regime, followed by instruction fine-tuning with curated datasets. SOLO achieves performance competitive with LLaVA-v1.5-7B and excels in visual mathematical reasoning, while offering superior training/inference speed and easier scaling-law analysis, especially under high-resolution inputs. The paper provides a complete, open-source training recipe suitable for modest academic resources, highlighting the practical viability and scalability benefits of unified vision-language modeling while candidly noting current limitations and areas for further improvement.

Abstract

We present SOLO, a single transformer for Scalable visiOn-Language mOdeling. Current large vision-language models (LVLMs) such as LLaVA mostly employ heterogeneous architectures that connect pre-trained visual encoders with large language models (LLMs) to facilitate visual recognition and complex reasoning. Although achieving remarkable performance with relatively lightweight training, we identify four primary scalability limitations: (1) The visual capacity is constrained by pre-trained visual encoders, which are typically an order of magnitude smaller than LLMs. (2) The heterogeneous architecture complicates the use of established hardware and software infrastructure. (3) Study of scaling laws on such architecture must consider three separate components - visual encoder, connector, and LLMs, which complicates the analysis. (4) The use of existing visual encoders typically requires following a pre-defined specification of image inputs pre-processing, for example, by reshaping inputs to fixed-resolution square images, which presents difficulties in processing and training on high-resolution images or those with unusual aspect ratio. A unified single Transformer architecture, like SOLO, effectively addresses these scalability concerns in LVLMs; however, its limited adoption in the modern context likely stems from the absence of reliable training recipes that balance both modalities and ensure stable training for billion-scale models. In this paper, we introduce the first open-source training recipe for developing SOLO, an open-source 7B LVLM using moderate academic resources. The training recipe involves initializing from LLMs, sequential pre-training on ImageNet and web-scale data, and instruction fine-tuning on our curated high-quality datasets. On extensive evaluation, SOLO demonstrates performance comparable to LLaVA-v1.5-7B, particularly excelling in visual mathematical reasoning.

SOLO: A Single Transformer for Scalable Vision-Language Modeling

TL;DR

This work argues that a unified Transformer architecture can overcome scalability limits faced by multi-component LVLMs by removing the dependence on pre-trained visual encoders. The authors introduce SOLO, a 7B LVLM initialized from Mistral-7B and trained via a three-stage pre-training curriculum on ImageNet, web-scale data, and an annealing regime, followed by instruction fine-tuning with curated datasets. SOLO achieves performance competitive with LLaVA-v1.5-7B and excels in visual mathematical reasoning, while offering superior training/inference speed and easier scaling-law analysis, especially under high-resolution inputs. The paper provides a complete, open-source training recipe suitable for modest academic resources, highlighting the practical viability and scalability benefits of unified vision-language modeling while candidly noting current limitations and areas for further improvement.

Abstract

We present SOLO, a single transformer for Scalable visiOn-Language mOdeling. Current large vision-language models (LVLMs) such as LLaVA mostly employ heterogeneous architectures that connect pre-trained visual encoders with large language models (LLMs) to facilitate visual recognition and complex reasoning. Although achieving remarkable performance with relatively lightweight training, we identify four primary scalability limitations: (1) The visual capacity is constrained by pre-trained visual encoders, which are typically an order of magnitude smaller than LLMs. (2) The heterogeneous architecture complicates the use of established hardware and software infrastructure. (3) Study of scaling laws on such architecture must consider three separate components - visual encoder, connector, and LLMs, which complicates the analysis. (4) The use of existing visual encoders typically requires following a pre-defined specification of image inputs pre-processing, for example, by reshaping inputs to fixed-resolution square images, which presents difficulties in processing and training on high-resolution images or those with unusual aspect ratio. A unified single Transformer architecture, like SOLO, effectively addresses these scalability concerns in LVLMs; however, its limited adoption in the modern context likely stems from the absence of reliable training recipes that balance both modalities and ensure stable training for billion-scale models. In this paper, we introduce the first open-source training recipe for developing SOLO, an open-source 7B LVLM using moderate academic resources. The training recipe involves initializing from LLMs, sequential pre-training on ImageNet and web-scale data, and instruction fine-tuning on our curated high-quality datasets. On extensive evaluation, SOLO demonstrates performance comparable to LLaVA-v1.5-7B, particularly excelling in visual mathematical reasoning.
Paper Structure (56 sections, 1 equation, 13 figures, 7 tables)

This paper contains 56 sections, 1 equation, 13 figures, 7 tables.

Figures (13)

  • Figure 1: (Previous work) The mainstream approaches for vision-language modeling rely on pre-trained visual encoders for visual feature extraction, which exhibits scalability limitations. (Our work) We advocate for a unified transformer architecture that processes both images and text, employing a simple linear projection to directly handle raw image pixels. <vision>, </vision>, and <vrow_sep> are special tokens designed explicitly for visual modality encoding.
  • Figure 2: The input image resize algorithm to maintain the aspect ratio.
  • Figure 3: Image captioning loss using two differently initialized checkpoints: (1) caption-only pre-training (green) initialized from the LLM; (2) two-stage pre-training (blue) initialized from the Stage-1 ImageNet pre-trained LVLM.
  • Figure 4: Qualitative analysis of caption-only pre-training and SOLO's two-stage pre-training. Comparisons are made on two checkpoints with comparable vision-language modeling loss (i.e., 2.1). Specifically, we select the caption-only checkpoint at pre-training step 150, and SOLO at step 100.
  • Figure 5: The evaluation performance of various ablations to validate key ingredients of our recipe. The MME scores are normalized for better illustration.
  • ...and 8 more figures