Table of Contents
Fetching ...

The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer

Weixian Lei, Jiacong Wang, Haochen Wang, Xiangtai Li, Jun Hao Liew, Jiashi Feng, Zilong Huang

TL;DR

This work probes whether a single-transformer model can achieve multimodal vision-language competence without a pretrained vision encoder. Through a two-stage pretraining curriculum, mixed attention for image patches, and multimodal rotary position encoding, Sail attains competitive results on vision-language benchmarks and strong visual representations, often rivaling modular MLLMs. The study reveals favorable data- and model-scaling properties, a vision-centric information flow, and emergent vision-encoding capabilities within a unified architecture. It also provides ablation insights on attention mechanisms and the value of text-only pretraining signals to preserve language proficiency. Overall, encoder-free single-transformer MLLMs show promise for scalable, simplified multimodal learning with robust vision-backbone potential.

Abstract

This paper introduces SAIL, a single transformer unified multimodal large language model (MLLM) that integrates raw pixel encoding and language decoding within a singular architecture. Unlike existing modular MLLMs, which rely on a pre-trained vision transformer (ViT), SAIL eliminates the need for a separate vision encoder, presenting a more minimalist architecture design. Instead of introducing novel architectural components, SAIL adapts mix-attention mechanisms and multimodal positional encodings to better align with the distinct characteristics of visual and textual modalities. We systematically compare SAIL's properties-including scalability, cross-modal information flow patterns, and visual representation capabilities-with those of modular MLLMs. By scaling both training data and model size, SAIL achieves performance comparable to modular MLLMs. Notably, the removal of pretrained ViT components enhances SAIL's scalability and results in significantly different cross-modal information flow patterns. Moreover, SAIL demonstrates strong visual representation capabilities, achieving results on par with ViT-22B in vision tasks such as semantic segmentation. Code and models are available at https://github.com/bytedance/SAIL.

The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer

TL;DR

This work probes whether a single-transformer model can achieve multimodal vision-language competence without a pretrained vision encoder. Through a two-stage pretraining curriculum, mixed attention for image patches, and multimodal rotary position encoding, Sail attains competitive results on vision-language benchmarks and strong visual representations, often rivaling modular MLLMs. The study reveals favorable data- and model-scaling properties, a vision-centric information flow, and emergent vision-encoding capabilities within a unified architecture. It also provides ablation insights on attention mechanisms and the value of text-only pretraining signals to preserve language proficiency. Overall, encoder-free single-transformer MLLMs show promise for scalable, simplified multimodal learning with robust vision-backbone potential.

Abstract

This paper introduces SAIL, a single transformer unified multimodal large language model (MLLM) that integrates raw pixel encoding and language decoding within a singular architecture. Unlike existing modular MLLMs, which rely on a pre-trained vision transformer (ViT), SAIL eliminates the need for a separate vision encoder, presenting a more minimalist architecture design. Instead of introducing novel architectural components, SAIL adapts mix-attention mechanisms and multimodal positional encodings to better align with the distinct characteristics of visual and textual modalities. We systematically compare SAIL's properties-including scalability, cross-modal information flow patterns, and visual representation capabilities-with those of modular MLLMs. By scaling both training data and model size, SAIL achieves performance comparable to modular MLLMs. Notably, the removal of pretrained ViT components enhances SAIL's scalability and results in significantly different cross-modal information flow patterns. Moreover, SAIL demonstrates strong visual representation capabilities, achieving results on par with ViT-22B in vision tasks such as semantic segmentation. Code and models are available at https://github.com/bytedance/SAIL.

Paper Structure

This paper contains 19 sections, 7 figures, 14 tables.

Figures (7)

  • Figure 1: (A) Data scaling curve for Modular Multimodal Large Language Model (MLLM) and Sail, our Single Transformer-based MLLM. As pretraining data increases, the single transformer Sail shows a sharper performance gain, demonstrating its superior data scalability. (B) Comparison to existing Single Transformer-based MLLMs: our Sail pushes the performance boundaries on both vision tasks and vision-language tasks.
  • Figure 2: Model architecture and micro-designs for Sail.(A) Model Architecture:Sail is a unified transformer that processes both images and texts without extra module designs. (B) Mixed Attention Mechanism: we adopt bidirectional attention for image patches from the same image and causal attention for text tokens. Examples for a multimodal sequence and a text sequence are provided. Colored squares represent "allow to attend" and white squares indicate "prevent from attending". (C) Multimodal RoPE: an illustration of the multimodal rotary position embedding for Sail, with examples for a multimodal sequence and a text sequence.
  • Figure 3: Model scaling of Sail. Left: As the model size increases, the training language modeling loss gradually decreases. Right: As the model size increases, performance on downstream VLM tasks progressively improves.
  • Figure 4: Image Attention Score Allocation: The figure shows the proportion of image attention scores across different transformer layers for Single Transformer-based MLLM and modular MLLM when predicting tokens. Single Transformer-based MLLM generally allocates higher attention weights to image tokens compared to modular MLLM.
  • Figure 5: Comparison of Sail and LLaVA1.5 on MMVP examples. Sail demonstrates better performance in perceiving minor regions and objects, as well as more accurately distinguishing object states.
  • ...and 2 more figures