SOLO: A Single Transformer for Scalable Vision-Language Modeling
Yangyi Chen, Xingyao Wang, Hao Peng, Heng Ji
TL;DR
This work argues that a unified Transformer architecture can overcome scalability limits faced by multi-component LVLMs by removing the dependence on pre-trained visual encoders. The authors introduce SOLO, a 7B LVLM initialized from Mistral-7B and trained via a three-stage pre-training curriculum on ImageNet, web-scale data, and an annealing regime, followed by instruction fine-tuning with curated datasets. SOLO achieves performance competitive with LLaVA-v1.5-7B and excels in visual mathematical reasoning, while offering superior training/inference speed and easier scaling-law analysis, especially under high-resolution inputs. The paper provides a complete, open-source training recipe suitable for modest academic resources, highlighting the practical viability and scalability benefits of unified vision-language modeling while candidly noting current limitations and areas for further improvement.
Abstract
We present SOLO, a single transformer for Scalable visiOn-Language mOdeling. Current large vision-language models (LVLMs) such as LLaVA mostly employ heterogeneous architectures that connect pre-trained visual encoders with large language models (LLMs) to facilitate visual recognition and complex reasoning. Although achieving remarkable performance with relatively lightweight training, we identify four primary scalability limitations: (1) The visual capacity is constrained by pre-trained visual encoders, which are typically an order of magnitude smaller than LLMs. (2) The heterogeneous architecture complicates the use of established hardware and software infrastructure. (3) Study of scaling laws on such architecture must consider three separate components - visual encoder, connector, and LLMs, which complicates the analysis. (4) The use of existing visual encoders typically requires following a pre-defined specification of image inputs pre-processing, for example, by reshaping inputs to fixed-resolution square images, which presents difficulties in processing and training on high-resolution images or those with unusual aspect ratio. A unified single Transformer architecture, like SOLO, effectively addresses these scalability concerns in LVLMs; however, its limited adoption in the modern context likely stems from the absence of reliable training recipes that balance both modalities and ensure stable training for billion-scale models. In this paper, we introduce the first open-source training recipe for developing SOLO, an open-source 7B LVLM using moderate academic resources. The training recipe involves initializing from LLMs, sequential pre-training on ImageNet and web-scale data, and instruction fine-tuning on our curated high-quality datasets. On extensive evaluation, SOLO demonstrates performance comparable to LLaVA-v1.5-7B, particularly excelling in visual mathematical reasoning.
