Table of Contents
Fetching ...

EVEv2: Improved Baselines for Encoder-Free Vision-Language Models

Haiwen Diao, Xiaotong Li, Yufeng Cui, Yueze Wang, Haoge Deng, Ting Pan, Wenxuan Wang, Huchuan Lu, Xinlong Wang

TL;DR

This work tackles the challenge of achieving strong vision-language understanding with encoder-free models. It introduces EVEv2.0, a decoder-only VLM with complete modality-wise decomposition and a lossless patch-embedding visual encoder, guided by modality-specific routers to minimize interference. A four-stage training pipeline leverages a high-quality captioning engine (DenseFusion++) and multi-task data to build robust cross-modal alignment from scratch, achieving competitive results with encoder-based models using far less data. The study provides practical insights into scaling encoder-free VLMs and offers a concrete blueprint for building native, scalable multimodal systems.

Abstract

Existing encoder-free vision-language models (VLMs) are rapidly narrowing the performance gap with their encoder-based counterparts, highlighting the promising potential for unified multimodal systems with structural simplicity and efficient deployment. We systematically clarify the performance gap between VLMs using pre-trained vision encoders, discrete tokenizers, and minimalist visual layers from scratch, deeply excavating the under-examined characteristics of encoder-free VLMs. We develop efficient strategies for encoder-free VLMs that rival mainstream encoder-based ones. After an in-depth investigation, we launch EVEv2.0, a new and improved family of encoder-free VLMs. We show that: (i) Properly decomposing and hierarchically associating vision and language within a unified model reduces interference between modalities. (ii) A well-designed training strategy enables effective optimization for encoder-free VLMs. Through extensive evaluation, our EVEv2.0 represents a thorough study for developing a decoder-only architecture across modalities, demonstrating superior data efficiency and strong vision-reasoning capability. Code is publicly available at: https://github.com/baaivision/EVE.

EVEv2: Improved Baselines for Encoder-Free Vision-Language Models

TL;DR

This work tackles the challenge of achieving strong vision-language understanding with encoder-free models. It introduces EVEv2.0, a decoder-only VLM with complete modality-wise decomposition and a lossless patch-embedding visual encoder, guided by modality-specific routers to minimize interference. A four-stage training pipeline leverages a high-quality captioning engine (DenseFusion++) and multi-task data to build robust cross-modal alignment from scratch, achieving competitive results with encoder-based models using far less data. The study provides practical insights into scaling encoder-free VLMs and offers a concrete blueprint for building native, scalable multimodal systems.

Abstract

Existing encoder-free vision-language models (VLMs) are rapidly narrowing the performance gap with their encoder-based counterparts, highlighting the promising potential for unified multimodal systems with structural simplicity and efficient deployment. We systematically clarify the performance gap between VLMs using pre-trained vision encoders, discrete tokenizers, and minimalist visual layers from scratch, deeply excavating the under-examined characteristics of encoder-free VLMs. We develop efficient strategies for encoder-free VLMs that rival mainstream encoder-based ones. After an in-depth investigation, we launch EVEv2.0, a new and improved family of encoder-free VLMs. We show that: (i) Properly decomposing and hierarchically associating vision and language within a unified model reduces interference between modalities. (ii) A well-designed training strategy enables effective optimization for encoder-free VLMs. Through extensive evaluation, our EVEv2.0 represents a thorough study for developing a decoder-only architecture across modalities, demonstrating superior data efficiency and strong vision-reasoning capability. Code is publicly available at: https://github.com/baaivision/EVE.

Paper Structure

This paper contains 17 sections, 3 equations, 10 figures, 12 tables.

Figures (10)

  • Figure 1: Overview of (1) diverse vision construction inside existing VLMs and (2) potential architecture variants of Encoder-Free VLMs.
  • Figure 2: Preliminary scaling efficiency analyses during pre-training or fine-tuning across various VLMs. (More details in the Appendix). Notably, VE / DT / EVE apply varying image downsampling rates (14$^2$ / 8$^2$ / 32$^2$). For fairness, we choose different resolutions that yield relatively balanced token counts of 576 / 1024 / 625 tokens per image. Besides, we quantify weight changes between LLMs and VLMs by averaging absolute value variation within specific layer number or type. We report GQA Datasets:GQA, SEED VLM:SEED, and TextVQA Datasets:TextVQA for in-domain, open-domain, and OCR-related validation. Note that SQA Datasets:ScienceQA involves text-related knowledge tasks susceptible to LLM's forgetting issue.
  • Figure 3: Overview of our proposed EVEv2.0 framework. We first adopt a patch embedding layer to encode images losslessly, and then concatenate visual and textual tokens into a unified decoder-only vision-language model. Here, it extends the standard autoregressive transformer by incorporating modality-specific weights for each multi-head self-attention layer, feed-forward layer, and layer normalization.
  • Figure 4: Overview of training procedure. PEL/WEL denotes patch/word embedding layer. We begin by training the patch embedding layer to establish initial alignment across modalities. Afterward, we only update vision layers within the LLM to enhance visual perception progressively. Notably, we gradually increase the image resolutions from 800$\times$800 to 1600$\times$1600 and keep the original image aspect ratio. Finally, we train the entire model via QA and instruction data to strengthen cross-modality correspondence and complex understanding.
  • Figure 5: Training loss curve and evaluation results in Stage 2. We adopt various EVE variants based on Qwen-2.5 qwen2.5 as the baseline. We first train the patch embedding layer using EVE-recap-10M in Stage 1, and further unfreeze vision layers except LLM layers in Stage 2.
  • ...and 5 more figures