Table of Contents
Fetching ...

MEDA: Dynamic KV Cache Allocation for Efficient Multimodal Long-Context Inference

Zhongwei Wan, Hui Shen, Xin Wang, Che Liu, Zheda Mai, Mi Zhang

TL;DR

MEDA tackles the memory and latency challenges of multimodal long-context inference by dynamically allocating KV caches across Transformer layers guided by cross-modal attention entropy. It combines a per-layer budgeting mechanism with a KV pair selection and average-merge strategy to preserve critical multimodal information under compression, without requiring fine-tuning. Across MileBench and long-video benchmarks, MEDA achieves up to 72% KV cache memory reductions and up to 2.82x faster decoding while maintaining or improving performance. The approach offers a practical, plug-and-play solution for efficient multimodal long-context reasoning in diverse models and tasks.

Abstract

Long-context Multimodal Large Language Models (MLLMs) that incorporate long text-image and text-video modalities, demand substantial resources as their multimodal Key-Value (KV) caches grow with increasing input lengths, challenging inference efficiency. Existing methods for KV cache compression, in both text-only and multimodal LLMs, have neglected attention density variations across layers, thus often adopting uniform or progressive reduction strategies for layer-wise cache allocation. In this work, we propose MEDA, a dynamic layer-wise KV cache allocation method for efficient multimodal long-context inference. As its core, MEDA utilizes cross-modal attention entropy to determine the KV cache size at each MLLMs layer. Given the dynamically allocated KV cache size at each layer, MEDA also employs a KV pair selection scheme to identify which KV pairs to select and a KV pair merging strategy that merges the selected and non-selected ones to preserve information from the entire context. MEDA achieves up to 72% KV cache memory reduction and 2.82 times faster decoding speed, while maintaining or enhancing performance on various multimodal tasks in long-context settings, including multi-images and long-video scenarios. Our code is released at https://github.com/AIoT-MLSys-Lab/MEDA.

MEDA: Dynamic KV Cache Allocation for Efficient Multimodal Long-Context Inference

TL;DR

MEDA tackles the memory and latency challenges of multimodal long-context inference by dynamically allocating KV caches across Transformer layers guided by cross-modal attention entropy. It combines a per-layer budgeting mechanism with a KV pair selection and average-merge strategy to preserve critical multimodal information under compression, without requiring fine-tuning. Across MileBench and long-video benchmarks, MEDA achieves up to 72% KV cache memory reductions and up to 2.82x faster decoding while maintaining or improving performance. The approach offers a practical, plug-and-play solution for efficient multimodal long-context reasoning in diverse models and tasks.

Abstract

Long-context Multimodal Large Language Models (MLLMs) that incorporate long text-image and text-video modalities, demand substantial resources as their multimodal Key-Value (KV) caches grow with increasing input lengths, challenging inference efficiency. Existing methods for KV cache compression, in both text-only and multimodal LLMs, have neglected attention density variations across layers, thus often adopting uniform or progressive reduction strategies for layer-wise cache allocation. In this work, we propose MEDA, a dynamic layer-wise KV cache allocation method for efficient multimodal long-context inference. As its core, MEDA utilizes cross-modal attention entropy to determine the KV cache size at each MLLMs layer. Given the dynamically allocated KV cache size at each layer, MEDA also employs a KV pair selection scheme to identify which KV pairs to select and a KV pair merging strategy that merges the selected and non-selected ones to preserve information from the entire context. MEDA achieves up to 72% KV cache memory reduction and 2.82 times faster decoding speed, while maintaining or enhancing performance on various multimodal tasks in long-context settings, including multi-images and long-video scenarios. Our code is released at https://github.com/AIoT-MLSys-Lab/MEDA.

Paper Structure

This paper contains 29 sections, 17 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: A multimodal long-context sample from Video-ChatGPT maaz2023video, showing key information interactions between blue-boxed video frames and textual phrases.
  • Figure 2: Using the cross-modal attention entropy from Eq. \ref{['eq: Entropy']}, we analyze LLaVA-NeXT-7B liu2024llava across different sub-tasks Song2024MileBenchBM. We observe varying multimodal interaction patterns: early layers (e.g., Layer 1) exhibit dense attention weights with higher entropy, while deeper layers (e.g., Layer 24) exhibit sparse attention weights with lower entropy, given that they focus on key tokens (red columns), similar to the blue areas and text in Figure \ref{['fig: sample1']}.
  • Figure 3: Illustration of MEDA's multimodal attention entropy-guided dynamic KV cache allocation and merging strategy.
  • Figure 4: Impact of compression ratio $\rho$.
  • Figure 5: An example of video content understanding and question answering based on Video-ChatGPT using the LongVA model and the KV cache compression technique of MEDA.