Table of Contents
Fetching ...

Eve: Efficient Multimodal Vision Language Models with Elastic Visual Experts

Miao Rang, Zhenni Bi, Chuanjian Liu, Yehui Tang, Kai Han, Yunhe Wang

TL;DR

The paper tackles the challenge of deploying capable vision-language models on edge devices by proposing Eve, a three-stage framework that injects elastic visual experts into a frozen or lightly-tuned LLM to maximize multimodal performance without sacrificing linguistic proficiency. It introduces an Elastic Vision Encoder and an Elastic Vision FFN (EVF) with routing and GBPR/Img-GBPR token allocation, enabling selective, modality-aware processing of tokens. Through a curated Stage-3 dataset and careful training strategies, Eve achieves 68.87% VLM average under 3B parameters and outperforms several larger models in multimodal accuracy while preserving language tasks. The work demonstrates that dynamic visual specialization, combined with balanced loss and efficient data usage, yields strong, practical SVLMs suitable for edge deployment with significantly reduced training overhead.

Abstract

Multimodal vision language models (VLMs) have made significant progress with the support of continuously increasing model sizes and data volumes. Running VLMs on edge devices has become a challenge for their widespread application. There are several efficient VLM efforts, but they often sacrifice linguistic capabilities to enhance multimodal abilities, or require extensive training. To address this quandary,we introduce the innovative framework of Efficient Vision Language Models with Elastic Visual Experts (Eve). By strategically incorporating adaptable visual expertise at multiple stages of training, Eve strikes a balance between preserving linguistic abilities and augmenting multimodal capabilities. This balanced approach results in a versatile model with only 1.8B parameters that delivers significant improvements in both multimodal and linguistic tasks. Notably, in configurations below 3B parameters, Eve distinctly outperforms in language benchmarks and achieves state-of-the-art results 68.87% in VLM Benchmarks. Additionally, its multimodal accuracy outstrips that of the larger 7B LLaVA-1.5 model. Our code is available at https://github.com/rangmiao/Eve.

Eve: Efficient Multimodal Vision Language Models with Elastic Visual Experts

TL;DR

The paper tackles the challenge of deploying capable vision-language models on edge devices by proposing Eve, a three-stage framework that injects elastic visual experts into a frozen or lightly-tuned LLM to maximize multimodal performance without sacrificing linguistic proficiency. It introduces an Elastic Vision Encoder and an Elastic Vision FFN (EVF) with routing and GBPR/Img-GBPR token allocation, enabling selective, modality-aware processing of tokens. Through a curated Stage-3 dataset and careful training strategies, Eve achieves 68.87% VLM average under 3B parameters and outperforms several larger models in multimodal accuracy while preserving language tasks. The work demonstrates that dynamic visual specialization, combined with balanced loss and efficient data usage, yields strong, practical SVLMs suitable for edge deployment with significantly reduced training overhead.

Abstract

Multimodal vision language models (VLMs) have made significant progress with the support of continuously increasing model sizes and data volumes. Running VLMs on edge devices has become a challenge for their widespread application. There are several efficient VLM efforts, but they often sacrifice linguistic capabilities to enhance multimodal abilities, or require extensive training. To address this quandary,we introduce the innovative framework of Efficient Vision Language Models with Elastic Visual Experts (Eve). By strategically incorporating adaptable visual expertise at multiple stages of training, Eve strikes a balance between preserving linguistic abilities and augmenting multimodal capabilities. This balanced approach results in a versatile model with only 1.8B parameters that delivers significant improvements in both multimodal and linguistic tasks. Notably, in configurations below 3B parameters, Eve distinctly outperforms in language benchmarks and achieves state-of-the-art results 68.87% in VLM Benchmarks. Additionally, its multimodal accuracy outstrips that of the larger 7B LLaVA-1.5 model. Our code is available at https://github.com/rangmiao/Eve.
Paper Structure (43 sections, 4 equations, 7 figures, 12 tables)

This paper contains 43 sections, 4 equations, 7 figures, 12 tables.

Figures (7)

  • Figure 1: Comparison with SOTA methods with 1B scale across VLM and language benchmarks.
  • Figure 2: The Eve training framework and strategy. The Eve employs a meticulously structured three-stage training approach. Stage 1: Training is dedicated to the vision adapter to adapt the LLM specifically for processing visual inputs. Stage 2: To enhance multimodal capabilities, training vision adapter and LMM with LoRA. Stage 3: We introduce a new EVF layer, consisting of an elastic vision FFN and a fixed language FFN. The weights from the original FFN are duplicated to initialize the two FFNs in alternating half-layers of the LLM. This stage involves isolated training of the vision FFN, aimed at significantly enhancing the model's proficiency in visual information comprehension.
  • Figure 3: The impact of token allocation mechanisms on successful routing in Layer 1, 11 and 21.
  • Figure 4: Example images of the general multi-modal dataset.
  • Figure 5: Display of examples of VQA data.
  • ...and 2 more figures