Not All Attention is Needed: Parameter and Computation Efficient Transfer Learning for Multi-modal Large Language Models
Qiong Wu, Weihao Ye, Yiyi Zhou, Xiaoshuai Sun, Rongrong Ji
TL;DR
The paper tackles the high computational cost of multi-modal large language models by showing that many multi-head attention (MHA) components are redundant for downstream tasks. It introduces Efficient Attention Skipping (EAS), combining a reinforcement-learning-driven redundancy evaluation with a Propagation-of-Information Adapter (PIA) that can be re-parameterized into FFNs for zero-added latency. Empirical results on LaVIN and METER demonstrate that EAS preserves performance while achieving significant speedups (e.g., up to 2.18×) and substantial reductions in updated parameters. This approach offers a practical path to parameter- and computation-efficient transfer learning for multi-modal LLMs, with broad applicability across VL benchmarks and models.
Abstract
In this paper, we propose a novel parameter and computation efficient tuning method for Multi-modal Large Language Models (MLLMs), termed Efficient Attention Skipping (EAS). Concretely, we first reveal that multi-head attentions (MHAs), the main computational overhead of MLLMs, are often redundant to downstream tasks. Based on this observation, EAS evaluates the attention redundancy and skips the less important MHAs to speed up inference. Besides, we also propose a novel propagation-of-information adapter (PIA) to serve the attention skipping of EAS and keep parameter efficiency, which can be further re-parameterized into feed-forward networks (FFNs) for zero-extra latency. To validate EAS, we apply it to a recently proposed MLLM called LaVIN and a classic VL pre-trained model called METER, and conduct extensive experiments on a set of benchmarks. The experiments show that EAS not only retains high performance and parameter efficiency, but also greatly speeds up inference speed. For instance, LaVIN-EAS can obtain 89.98\% accuracy on ScineceQA while speeding up inference by 2.2 times to LaVIN
