Table of Contents
Fetching ...

Are Bigger Encoders Always Better in Vision Large Models?

Bozhou Li, Hao Liang, Zimo Meng, Wentao Zhang

TL;DR

This study investigates whether simply scaling vision encoders yields better performance in vision-language multimodal LLMs built on the connected vision paradigm. Using ViT-CLIP encoders and Vicuna backbones (7B and 13B) pretrained on CC12M and LAION-400M, with the MM PT stage freezing backbones and training a two-layer MLP projector, it conducts a systematic scaling-law analysis across data sizes from 1M to 10M image-text pairs. The key finding is that larger vision encoders do not guarantee improvements; data quality and the size of the LLM backbone exert substantial influence, and larger LLMs can be more data-efficient. The results motivate data-centric alignment strategies and architectural innovations beyond naive encoder scaling, with implications for resource allocation and future multimodal model design.

Abstract

In recent years, multimodal large language models (MLLMs) have shown strong potential in real-world applications. They are developing rapidly due to their remarkable ability to comprehend multimodal information and their inherent powerful cognitive and reasoning capabilities. Among MLLMs, vision language models (VLM) stand out for their ability to understand vision information. However, the scaling trend of VLMs under the current mainstream paradigm has not been extensively studied. Whether we can achieve better performance by training even larger models is still unclear. To address this issue, we conducted experiments on the pretraining stage of MLLMs. We conduct our experiment using different encoder sizes and large language model (LLM) sizes. Our findings indicate that merely increasing the size of encoders does not necessarily enhance the performance of VLMs. Moreover, we analyzed the effects of LLM backbone parameter size and data quality on the pretraining outcomes. Additionally, we explored the differences in scaling laws between LLMs and VLMs.

Are Bigger Encoders Always Better in Vision Large Models?

TL;DR

This study investigates whether simply scaling vision encoders yields better performance in vision-language multimodal LLMs built on the connected vision paradigm. Using ViT-CLIP encoders and Vicuna backbones (7B and 13B) pretrained on CC12M and LAION-400M, with the MM PT stage freezing backbones and training a two-layer MLP projector, it conducts a systematic scaling-law analysis across data sizes from 1M to 10M image-text pairs. The key finding is that larger vision encoders do not guarantee improvements; data quality and the size of the LLM backbone exert substantial influence, and larger LLMs can be more data-efficient. The results motivate data-centric alignment strategies and architectural innovations beyond naive encoder scaling, with implications for resource allocation and future multimodal model design.

Abstract

In recent years, multimodal large language models (MLLMs) have shown strong potential in real-world applications. They are developing rapidly due to their remarkable ability to comprehend multimodal information and their inherent powerful cognitive and reasoning capabilities. Among MLLMs, vision language models (VLM) stand out for their ability to understand vision information. However, the scaling trend of VLMs under the current mainstream paradigm has not been extensively studied. Whether we can achieve better performance by training even larger models is still unclear. To address this issue, we conducted experiments on the pretraining stage of MLLMs. We conduct our experiment using different encoder sizes and large language model (LLM) sizes. Our findings indicate that merely increasing the size of encoders does not necessarily enhance the performance of VLMs. Moreover, we analyzed the effects of LLM backbone parameter size and data quality on the pretraining outcomes. Additionally, we explored the differences in scaling laws between LLMs and VLMs.
Paper Structure (11 sections, 3 figures, 4 tables)

This paper contains 11 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Using Vicuna-7B as LLM backbone, trained on data sampled from CC12M
  • Figure 2: Using Vicuna-13B as LLM backbone, trained on data sampled from CC12M
  • Figure 3: Using Vicuna-7B as LLM backbone, trained on data sampled from LAION-400M