Table of Contents
Fetching ...

Skipping Computations in Multimodal LLMs

Mustafa Shukor, Matthieu Cord

TL;DR

It is shown that there is redundant computations inside MLLMs and thus the potential for significantly improving inference costs without sacrificing performance, and thus the potential for significantly improving inference costs without sacrificing performance.

Abstract

Large Language Models (LLMs) have demonstrated remarkable success in both textual and multimodal domains. However, this success often comes with substantial computational costs, particularly when handling lengthy sequences of multimodal inputs. This has sparked many efforts focusing on enhancing efficiency during training and inference. In this study, we investigate the computation redundancy in Multimodal Large Language Models (MLLMs) during inference. We propose different methods to skip computations, such as skipping entire blocks, FFN or self-attention (SA) layers. Additionally, we explore parallelizing certain layers, such as FFN and SA layers. Our findings validate that (1) significant amount of computations can be avoided at inference time, especially for tasks such as Visual Question Answering (VQA). (2) Skipping computations during training can recover 97% of the original performance, even when skipping half of the blocks or removing 70% of the weights. Alternatively, (3) properly training with smaller LLMs can yield comparable performance to LLMs 2 or 3 times larger. To conclude, we extend our investigation to recent MLLMs, such as LLaVA-1.5, showing similar observations. Our work show that there is redundant computations inside MLLMs and thus the potential for significantly improving inference costs without sacrificing performance. The code is available here: https://github.com/mshukor/ima-lmms.

Skipping Computations in Multimodal LLMs

TL;DR

It is shown that there is redundant computations inside MLLMs and thus the potential for significantly improving inference costs without sacrificing performance, and thus the potential for significantly improving inference costs without sacrificing performance.

Abstract

Large Language Models (LLMs) have demonstrated remarkable success in both textual and multimodal domains. However, this success often comes with substantial computational costs, particularly when handling lengthy sequences of multimodal inputs. This has sparked many efforts focusing on enhancing efficiency during training and inference. In this study, we investigate the computation redundancy in Multimodal Large Language Models (MLLMs) during inference. We propose different methods to skip computations, such as skipping entire blocks, FFN or self-attention (SA) layers. Additionally, we explore parallelizing certain layers, such as FFN and SA layers. Our findings validate that (1) significant amount of computations can be avoided at inference time, especially for tasks such as Visual Question Answering (VQA). (2) Skipping computations during training can recover 97% of the original performance, even when skipping half of the blocks or removing 70% of the weights. Alternatively, (3) properly training with smaller LLMs can yield comparable performance to LLMs 2 or 3 times larger. To conclude, we extend our investigation to recent MLLMs, such as LLaVA-1.5, showing similar observations. Our work show that there is redundant computations inside MLLMs and thus the potential for significantly improving inference costs without sacrificing performance. The code is available here: https://github.com/mshukor/ima-lmms.

Paper Structure

This paper contains 32 sections, 8 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Illustration of the proposed techniques to skip and parrallelize computations in multimodal LLMs. From left to right: Skipping entire blocks (Skip Block), skipping only the FFN layers (Skip FFN), skipping the self-attention layers (Skip SA), parrallelizing the FFN and SA, parrellizing entire blocks. These are applied each interval I of layers, starting at a specific layer (sl).
  • Figure 2: Skipping computations inside MLLMs. We skip entire blocks (Skip Block), FFN (Skip FFN) or SA layers (Skip SA). The skipping start at layer 4 and happen each couple of layers (Layer Interval). The gray line indicate 90% of original performance (shown in yellow).
  • Figure 3: Which tokens to skip? We compare between skipping layers, only for the generated textual tokens (T), and all tokens including the prompts (P+T).
  • Figure 4: Where to start skipping layers? Skipping early layers leads to further decrease in scores. Starting at layer 8 (sl=8) leads to the best performance, especially when skipping many blocks.
  • Figure 5: Parrellelizing computations inside MLLMs. FFN and SA layers can be cast in parrallel instead of sequential without sacrificing performance. This is less the case for 2 entire blocks.
  • ...and 2 more figures