Table of Contents
Fetching ...

Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification

Wenxuan Huang, Zijie Zhai, Yunhang Shen, Shaosheng Cao, Fei Zhao, Xiangfeng Xu, Zheyu Ye, Yao Hu, Shaohui Lin

TL;DR

Dynamic-LLaVA tackles the essential efficiency bottlenecks of multimodal LLM inference by jointly sparsifying vision and language contexts across prefill and decoding. It introduces two learnable predictors and mode-specific sparsification rules, trained end-to-end with MaskedSoftmax and Gumbel-Softmax to determine token retention, and supports batch-parallel sparsification. The approach yields substantial efficiency gains—approximately 75% FLOPs reduction in prefill and about 50% FLOPs or memory savings during decoding with/without KV cache—while maintaining, and in some cases improving, vision understanding and generation quality. Its ability to integrate with existing vision projectors and enable online KV-cache decisions makes it a practical and scalable path toward efficient multimodal reasoning in real-world applications.

Abstract

Multimodal Large Language Models (MLLMs) have achieved remarkable success in vision understanding, reasoning, and interaction. However, the inference computation and memory increase progressively with the generation of output tokens during decoding, directly affecting the efficacy of MLLMs. Existing methods attempt to reduce the vision context redundancy to achieve efficient MLLMs. Unfortunately, the efficiency benefits of the vision context reduction in the prefill stage gradually diminish during the decoding stage. To address this problem, we proposed a dynamic vision-language context sparsification framework Dynamic-LLaVA, which dynamically reduces the redundancy of vision context in the prefill stage and decreases the memory and computation overhead of the generated language context during decoding. Dynamic-LLaVA designs a tailored sparsification inference scheme for different inference modes, i.e., prefill, decoding with and without KV cache, to achieve efficient inference of MLLMs. In practice, Dynamic-LLaVA can reduce computation consumption by $\sim$75\% in the prefill stage. Meanwhile, throughout the entire generation process of MLLMs, Dynamic-LLaVA reduces the $\sim$50\% computation consumption under decoding without KV cache, while saving $\sim$50\% GPU memory overhead when decoding with KV cache, due to the vision-language context sparsification. Extensive experiments also demonstrate that Dynamic-LLaVA achieves efficient inference for MLLMs with negligible understanding and generation ability degradation or even performance gains compared to the full-context inference baselines. Code is available at https://github.com/Osilly/dynamic_llava .

Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification

TL;DR

Dynamic-LLaVA tackles the essential efficiency bottlenecks of multimodal LLM inference by jointly sparsifying vision and language contexts across prefill and decoding. It introduces two learnable predictors and mode-specific sparsification rules, trained end-to-end with MaskedSoftmax and Gumbel-Softmax to determine token retention, and supports batch-parallel sparsification. The approach yields substantial efficiency gains—approximately 75% FLOPs reduction in prefill and about 50% FLOPs or memory savings during decoding with/without KV cache—while maintaining, and in some cases improving, vision understanding and generation quality. Its ability to integrate with existing vision projectors and enable online KV-cache decisions makes it a practical and scalable path toward efficient multimodal reasoning in real-world applications.

Abstract

Multimodal Large Language Models (MLLMs) have achieved remarkable success in vision understanding, reasoning, and interaction. However, the inference computation and memory increase progressively with the generation of output tokens during decoding, directly affecting the efficacy of MLLMs. Existing methods attempt to reduce the vision context redundancy to achieve efficient MLLMs. Unfortunately, the efficiency benefits of the vision context reduction in the prefill stage gradually diminish during the decoding stage. To address this problem, we proposed a dynamic vision-language context sparsification framework Dynamic-LLaVA, which dynamically reduces the redundancy of vision context in the prefill stage and decreases the memory and computation overhead of the generated language context during decoding. Dynamic-LLaVA designs a tailored sparsification inference scheme for different inference modes, i.e., prefill, decoding with and without KV cache, to achieve efficient inference of MLLMs. In practice, Dynamic-LLaVA can reduce computation consumption by 75\% in the prefill stage. Meanwhile, throughout the entire generation process of MLLMs, Dynamic-LLaVA reduces the 50\% computation consumption under decoding without KV cache, while saving 50\% GPU memory overhead when decoding with KV cache, due to the vision-language context sparsification. Extensive experiments also demonstrate that Dynamic-LLaVA achieves efficient inference for MLLMs with negligible understanding and generation ability degradation or even performance gains compared to the full-context inference baselines. Code is available at https://github.com/Osilly/dynamic_llava .

Paper Structure

This paper contains 35 sections, 13 equations, 8 figures, 17 tables.

Figures (8)

  • Figure 1: The entire generation process of MLLMs. As generation progresses, the primary resource overheads of MLLMs under decoding with and without KV cache modes are GPU memory overhead and computation consumption, respectively. Previous vision context sparsification methods achieved initial inference efficiency through vision context sparsification. However, these benefits gradually diminish as decoding continues. The results are measured in one A100 (80G) and the batch size is fixed to 8. "OOM" means the generation process has failed due to the out of GPU memory.
  • Figure 2: The sparsification inference modes for MLLMs. In the prefill stage, only image tokens are dropped based on the decisions of the learnable image predictor. For decoding without KV cache, we reduce both the image tokens and output text tokens to maintain consistent inference efficiency. When decoding with KV cache, the output predictor determines whether the activations generated by the current output text token should be added to KV cache and thereby discard part of the activations to reduce the size of KV cache. Note that the decision regarding the activations of the current output text token will be shared across all subsequent layers beyond the $l$-th layer. Meanwhile, the "Yes" branch means the decision to keep the token or its activations to participate in subsequent calculations.
  • Figure 3: The detailed training pipeline of Dynamic-LLaVA. Above Figure: the mask for MaskedSoftmax operation during training. We utilize the predictors to generate the binary mask $\mathcal{M}$ and subsequently form a binary mask matrix $\mathbb{G}$. This generated binary mask matrix is employed in the Multi-Head Attention Block within the MaskedSoftmax operation to isolate the influence of non-essential tokens on essential tokens during training. Bottom Figure: the pipeline of predictors during training. In the forward propagation, we use GumbelSoftmax function to relax the decision matrix $D^{I}$ and $D^{OT}$ to obtain $D^{I\dag}$ and $D^{OT\dag}$, respectively. Then, we use argmax operation to generate the binary mask $\mathcal{M}$ for the token set. During back propagation, we utilize the STE technique bengio2013estimating to directly estimate the gradient of $D^{I}$ and $D^{OT}$ through the binary mask $\mathcal{M}$, bypassing the non-differentiable argmax operation to avoid the gradient flow problem.
  • Figure 4: KV cache compression pipeline (H2O zhang2023h2o vs. Dynamic-LLaVA (when decoding with KV cache). Left Figure: the KV cache compression pipeline of H2O involves calculating the attention score between the current $Q$ and past KV cache during the decoding stage. The KV activations corresponding to the minimal attention score is subsequently dropped from historical KV cache. Right Figure: The workflow of Dynamic-LLaVA when decoding with KV cache. Our approach evaluates each current token's features by an output predictor to determine whether its activations which through $W_K$ and $W_V$ should be added to the KV cache.
  • Figure 5: Overviews of the image predictor (a) and the output text predictor (b).
  • ...and 3 more figures