Eve: Efficient Multimodal Vision Language Models with Elastic Visual Experts
Miao Rang, Zhenni Bi, Chuanjian Liu, Yehui Tang, Kai Han, Yunhe Wang
TL;DR
The paper tackles the challenge of deploying capable vision-language models on edge devices by proposing Eve, a three-stage framework that injects elastic visual experts into a frozen or lightly-tuned LLM to maximize multimodal performance without sacrificing linguistic proficiency. It introduces an Elastic Vision Encoder and an Elastic Vision FFN (EVF) with routing and GBPR/Img-GBPR token allocation, enabling selective, modality-aware processing of tokens. Through a curated Stage-3 dataset and careful training strategies, Eve achieves 68.87% VLM average under 3B parameters and outperforms several larger models in multimodal accuracy while preserving language tasks. The work demonstrates that dynamic visual specialization, combined with balanced loss and efficient data usage, yields strong, practical SVLMs suitable for edge deployment with significantly reduced training overhead.
Abstract
Multimodal vision language models (VLMs) have made significant progress with the support of continuously increasing model sizes and data volumes. Running VLMs on edge devices has become a challenge for their widespread application. There are several efficient VLM efforts, but they often sacrifice linguistic capabilities to enhance multimodal abilities, or require extensive training. To address this quandary,we introduce the innovative framework of Efficient Vision Language Models with Elastic Visual Experts (Eve). By strategically incorporating adaptable visual expertise at multiple stages of training, Eve strikes a balance between preserving linguistic abilities and augmenting multimodal capabilities. This balanced approach results in a versatile model with only 1.8B parameters that delivers significant improvements in both multimodal and linguistic tasks. Notably, in configurations below 3B parameters, Eve distinctly outperforms in language benchmarks and achieves state-of-the-art results 68.87% in VLM Benchmarks. Additionally, its multimodal accuracy outstrips that of the larger 7B LLaVA-1.5 model. Our code is available at https://github.com/rangmiao/Eve.
