Table of Contents
Fetching ...

Not All Attention is Needed: Parameter and Computation Efficient Transfer Learning for Multi-modal Large Language Models

Qiong Wu, Weihao Ye, Yiyi Zhou, Xiaoshuai Sun, Rongrong Ji

TL;DR

The paper tackles the high computational cost of multi-modal large language models by showing that many multi-head attention (MHA) components are redundant for downstream tasks. It introduces Efficient Attention Skipping (EAS), combining a reinforcement-learning-driven redundancy evaluation with a Propagation-of-Information Adapter (PIA) that can be re-parameterized into FFNs for zero-added latency. Empirical results on LaVIN and METER demonstrate that EAS preserves performance while achieving significant speedups (e.g., up to 2.18×) and substantial reductions in updated parameters. This approach offers a practical path to parameter- and computation-efficient transfer learning for multi-modal LLMs, with broad applicability across VL benchmarks and models.

Abstract

In this paper, we propose a novel parameter and computation efficient tuning method for Multi-modal Large Language Models (MLLMs), termed Efficient Attention Skipping (EAS). Concretely, we first reveal that multi-head attentions (MHAs), the main computational overhead of MLLMs, are often redundant to downstream tasks. Based on this observation, EAS evaluates the attention redundancy and skips the less important MHAs to speed up inference. Besides, we also propose a novel propagation-of-information adapter (PIA) to serve the attention skipping of EAS and keep parameter efficiency, which can be further re-parameterized into feed-forward networks (FFNs) for zero-extra latency. To validate EAS, we apply it to a recently proposed MLLM called LaVIN and a classic VL pre-trained model called METER, and conduct extensive experiments on a set of benchmarks. The experiments show that EAS not only retains high performance and parameter efficiency, but also greatly speeds up inference speed. For instance, LaVIN-EAS can obtain 89.98\% accuracy on ScineceQA while speeding up inference by 2.2 times to LaVIN

Not All Attention is Needed: Parameter and Computation Efficient Transfer Learning for Multi-modal Large Language Models

TL;DR

The paper tackles the high computational cost of multi-modal large language models by showing that many multi-head attention (MHA) components are redundant for downstream tasks. It introduces Efficient Attention Skipping (EAS), combining a reinforcement-learning-driven redundancy evaluation with a Propagation-of-Information Adapter (PIA) that can be re-parameterized into FFNs for zero-added latency. Empirical results on LaVIN and METER demonstrate that EAS preserves performance while achieving significant speedups (e.g., up to 2.18×) and substantial reductions in updated parameters. This approach offers a practical path to parameter- and computation-efficient transfer learning for multi-modal LLMs, with broad applicability across VL benchmarks and models.

Abstract

In this paper, we propose a novel parameter and computation efficient tuning method for Multi-modal Large Language Models (MLLMs), termed Efficient Attention Skipping (EAS). Concretely, we first reveal that multi-head attentions (MHAs), the main computational overhead of MLLMs, are often redundant to downstream tasks. Based on this observation, EAS evaluates the attention redundancy and skips the less important MHAs to speed up inference. Besides, we also propose a novel propagation-of-information adapter (PIA) to serve the attention skipping of EAS and keep parameter efficiency, which can be further re-parameterized into feed-forward networks (FFNs) for zero-extra latency. To validate EAS, we apply it to a recently proposed MLLM called LaVIN and a classic VL pre-trained model called METER, and conduct extensive experiments on a set of benchmarks. The experiments show that EAS not only retains high performance and parameter efficiency, but also greatly speeds up inference speed. For instance, LaVIN-EAS can obtain 89.98\% accuracy on ScineceQA while speeding up inference by 2.2 times to LaVIN
Paper Structure (22 sections, 16 equations, 7 figures, 7 tables)

This paper contains 22 sections, 16 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: (a) Running time of LaVIN luo2023cheap, DAS wu2023parameter and our EAS. (b) Performance and speed comparisons of skipping different numbers of MHA and FFN by our EAS on ScienceQA.
  • Figure 2: Illustrations of LaVIN, DAS and our EAS. (a) LaVIN inserts lightweight adapters before MHAs for multi-modal adaption. (b) DAS skips redundant Transformer layers of LaVIN, but still incurs extra latency. (c) EAS resort to skipping MHAs, to achieve true model acceleration with the proposed PIA.
  • Figure 3: Illustrations of the main components of the proposed Effective Attention Skipping (EAS). (a) The architecture of propagation-of-information adapter (PIA). PIA uses a multi-path design for up- and down-samplings, which can help to perform information exchange, like MHA, and modality routing luo2023cheap for MLLMs. (b) The deployment of PIA. PIA can serve to replace the skipped MHA as a parameter efficient method for task adaption. After training, its parameters can be re-parameterized into FFN, incurring no extra latency. (c) The process of attention redundancy evaluation. Similar with DAS wu2023parameter, EAS also adopts a $k$-armed bandit based algorithm for the automatic redundancy evaluation on MHAs of MLLMs. After evaluation, we skip the redundant MHAs with PIAs.
  • Figure 4: The predictions of EAS-B$_{12}$ and LaVIN-7B on ScienceQA. The accurate explanations for the answer are highlight in green, while the logically incorrect ones in red.
  • Figure 5: Comparison between EAS and DAS on ScienceQA.
  • ...and 2 more figures