Table of Contents
Fetching ...

EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model

Feipeng Ma, Yizhou Zhou, Zheyu Zhang, Shilin Yan, Hebei Li, Zilong He, Siying Wu, Fengyun Rao, Yueyi Zhang, Xiaoyan Sun

TL;DR

EE-MLLM tackles the data-efficiency vs. compute-efficiency tension in multimodal large language models by introducing a composite attention mechanism that eliminates self-attention among visual tokens and a parameter-free aligner that reuses LLM weights for vision-language alignment. This yields a 24% FLOP reduction and drastically lowers prefilling time (e.g., 79 ms vs 277 ms on similar hardware) while maintaining strong performance across general and fine-grained benchmarks. A training-free variant (EE-MLLM-F) demonstrates the practicality of applying the method to existing self-attention models with minimal or no training. The approach achieves competitive results on diverse tasks, including TextVQA and DocVQA, and supports real-time applications such as content moderation and quality control, making it a versatile and scalable solution for multimodal understanding.

Abstract

Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated satisfactory performance across various vision-language tasks. Current approaches for vision and language interaction fall into two categories: self-attention-based and cross-attention-based methods. However, both approaches present inherent limitations, forcing a trade-off between data and computational efficiency. To address this issue, we introduce the Data-$\textbf{E}$fficient and Compute-$\textbf{E}$fficient $\textbf{MLLM}$ ($\textbf{EE-MLLM}$). Specifically, we modify the original self-attention mechanism in MLLM to a composite attention mechanism. This mechanism has two key characteristics: 1) eliminating the computational overhead of self-attention among visual tokens to achieve $\textbf{compute efficiency}$, and 2) reusing the weights from each layer of LLM to facilitate effective vision-language modality alignment for $\textbf{data efficiency}$. As a result, EE-MLLM significantly outperforms Flamingo with limited training data, and reduces the prefilling time to 79 ms on an H800 GPU, compared to LLaVA's 277 ms. To further investigate the efficiency of EE-MLLM, we present a training-free variant named EE-MLLM-F, which reduces the computation cost of self-attention-based method without additional training. Experimental results demonstrate the effectiveness of EE-MLLM across a range of benchmarks, including general-purpose datasets like MMBench and SeedBench, as well as fine-grained tasks such as TextVQA and DocVQA.

EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model

TL;DR

EE-MLLM tackles the data-efficiency vs. compute-efficiency tension in multimodal large language models by introducing a composite attention mechanism that eliminates self-attention among visual tokens and a parameter-free aligner that reuses LLM weights for vision-language alignment. This yields a 24% FLOP reduction and drastically lowers prefilling time (e.g., 79 ms vs 277 ms on similar hardware) while maintaining strong performance across general and fine-grained benchmarks. A training-free variant (EE-MLLM-F) demonstrates the practicality of applying the method to existing self-attention models with minimal or no training. The approach achieves competitive results on diverse tasks, including TextVQA and DocVQA, and supports real-time applications such as content moderation and quality control, making it a versatile and scalable solution for multimodal understanding.

Abstract

Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated satisfactory performance across various vision-language tasks. Current approaches for vision and language interaction fall into two categories: self-attention-based and cross-attention-based methods. However, both approaches present inherent limitations, forcing a trade-off between data and computational efficiency. To address this issue, we introduce the Data-fficient and Compute-fficient (). Specifically, we modify the original self-attention mechanism in MLLM to a composite attention mechanism. This mechanism has two key characteristics: 1) eliminating the computational overhead of self-attention among visual tokens to achieve , and 2) reusing the weights from each layer of LLM to facilitate effective vision-language modality alignment for . As a result, EE-MLLM significantly outperforms Flamingo with limited training data, and reduces the prefilling time to 79 ms on an H800 GPU, compared to LLaVA's 277 ms. To further investigate the efficiency of EE-MLLM, we present a training-free variant named EE-MLLM-F, which reduces the computation cost of self-attention-based method without additional training. Experimental results demonstrate the effectiveness of EE-MLLM across a range of benchmarks, including general-purpose datasets like MMBench and SeedBench, as well as fine-grained tasks such as TextVQA and DocVQA.
Paper Structure (29 sections, 4 equations, 6 figures, 7 tables)

This paper contains 29 sections, 4 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Architecture comparisons between self-attention-based method, cross-attention-based method and our EE-MLLM. (1) The self-attention-based mechanism utilizes a projector to align visual tokens with text tokens, subsequently concatenating these tokens as input for the LLM. (2) The cross-attention-based mechanism integrates additional cross-attention blocks into the decoder layers. (3) EE-MLLM introduces a composite attention mechanism to eliminate the computational overhead of self-attention within visual tokens and reuse the weights as aligners on each layer of LLM to facilitate modality alignment.
  • Figure 2: Our composite attention mechanism consists of the composite attention module and the aligner. For the aligner, visual tokens are aligned to the feature space of each layer of LLM by the aligner alone; for the composite attention module, the concatenation of visual and text tokens are used as keys and values, and the text tokens are used as queries for attention, thus eliminates the self-attention within visual tokens.
  • Figure 3: Comparison of prefilling time between EE-MLLM and LLaVA. The X-axis represents the number of input images, where each image has a resolution of $980 \times 980$. The Y-axis indicates the prefilling time.
  • Figure 4: The visualization results of examples sampled from BLINK and RealWorldQA. Answers are generated by our EE-MLLM.
  • Figure 5: Comparison of EE-MLLM and LLaVA-v1.6 response results on challenging samples.
  • ...and 1 more figures