Table of Contents
Fetching ...

EVLM: An Efficient Vision-Language Model for Visual Understanding

Kaibing Chen, Dong Shen, Hanwen Zhong, Huasong Zhong, Kui Xia, Di Xu, Wei Yuan, Yifei Hu, Bin Wen, Tianke Zhang, Changyi Liu, Dewen Fan, Huihui Xiao, Jiahong Wu, Fan Yang, Size Li, Di Zhang

TL;DR

EVLM tackles the efficiency and perceptual gaps of existing vision-language models by integrating gated cross-attention, hierarchical ViT features, and a Mixture-of-Experts approach. Trained in a three-stage pipeline on large bilingual image-text data, the model achieves competitive results across 13 multimodal benchmarks and excels at image and video captioning with reduced hallucinations. The work demonstrates a scalable, efficient vision-language fusion framework and introduces design choices (learnable image tokens and frame-aware attention) enabling effective processing of long sequences such as video. Overall, EVLM advances dense captioning capabilities while balancing computational cost through cross-attention-based interactions and MoE scaling.

Abstract

In the field of multi-modal language models, the majority of methods are built on an architecture similar to LLaVA. These models use a single-layer ViT feature as a visual prompt, directly feeding it into the language models alongside textual tokens. However, when dealing with long sequences of visual signals or inputs such as videos, the self-attention mechanism of language models can lead to significant computational overhead. Additionally, using single-layer ViT features makes it challenging for large language models to perceive visual signals fully. This paper proposes an efficient multi-modal language model to minimize computational costs while enabling the model to perceive visual signals as comprehensively as possible. Our method primarily includes: (1) employing cross-attention to image-text interaction similar to Flamingo. (2) utilize hierarchical ViT features. (3) introduce the Mixture of Experts (MoE) mechanism to enhance model effectiveness. Our model achieves competitive scores on public multi-modal benchmarks and performs well in tasks such as image captioning and video captioning.

EVLM: An Efficient Vision-Language Model for Visual Understanding

TL;DR

EVLM tackles the efficiency and perceptual gaps of existing vision-language models by integrating gated cross-attention, hierarchical ViT features, and a Mixture-of-Experts approach. Trained in a three-stage pipeline on large bilingual image-text data, the model achieves competitive results across 13 multimodal benchmarks and excels at image and video captioning with reduced hallucinations. The work demonstrates a scalable, efficient vision-language fusion framework and introduces design choices (learnable image tokens and frame-aware attention) enabling effective processing of long sequences such as video. Overall, EVLM advances dense captioning capabilities while balancing computational cost through cross-attention-based interactions and MoE scaling.

Abstract

In the field of multi-modal language models, the majority of methods are built on an architecture similar to LLaVA. These models use a single-layer ViT feature as a visual prompt, directly feeding it into the language models alongside textual tokens. However, when dealing with long sequences of visual signals or inputs such as videos, the self-attention mechanism of language models can lead to significant computational overhead. Additionally, using single-layer ViT features makes it challenging for large language models to perceive visual signals fully. This paper proposes an efficient multi-modal language model to minimize computational costs while enabling the model to perceive visual signals as comprehensively as possible. Our method primarily includes: (1) employing cross-attention to image-text interaction similar to Flamingo. (2) utilize hierarchical ViT features. (3) introduce the Mixture of Experts (MoE) mechanism to enhance model effectiveness. Our model achieves competitive scores on public multi-modal benchmarks and performs well in tasks such as image captioning and video captioning.
Paper Structure (24 sections, 2 equations, 11 figures, 6 tables)

This paper contains 24 sections, 2 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Some qualitative examples generated by our model.
  • Figure 2: The framework diagram of our multi-modal model.
  • Figure 3: Full Attention and Cross Attention used in multi-modal model.
  • Figure 4: MoE structure.
  • Figure 5: Visualization of the Convergence of the Pre-training Stage
  • ...and 6 more figures