Table of Contents
Fetching ...

VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration

Dezhan Tu, Danylo Vashchilenko, Yuzhe Lu, Panpan Xu

TL;DR

VL-Cache targets the KV cache bottleneck in vision-language model inference by introducing sparsity-aware, layer-specific budget allocation guided by post-vision attention and a modality-aware token scoring policy. By tailoring cache retention to per-prompt sparsity patterns and emphasizing post-vision token importance, it achieves near full-cache accuracy with as little as 10% of the KV cache and delivers substantial speedups, especially for long outputs. Experiments across Coco-Caption, DocVQA, and MathVista with LLaVA backbones demonstrate robust accuracy gains and up to 7x decoding speedups, with a 90% reduction in KV cache memory usage. The approach offers practical impact for real-time VLM deployments requiring reduced memory footprints and lower latency on long-context multimodal inputs.

Abstract

Vision-Language Models (VLMs) have demonstrated impressive performance across a versatile set of tasks. A key challenge in accelerating VLMs is storing and accessing the large Key-Value (KV) cache that encodes long visual contexts, such as images or videos. While existing KV cache compression methods are effective for Large Language Models (LLMs), directly migrating them to VLMs yields suboptimal accuracy and speedup. To bridge the gap, we propose VL-Cache, a novel KV cache compression recipe tailored for accelerating VLM inference. In this paper, we first investigate the unique sparsity pattern of VLM attention by distinguishing visual and text tokens in prefill and decoding phases. Based on these observations, we introduce a layer-adaptive sparsity-aware cache budget allocation method that effectively distributes the limited cache budget across different layers, further reducing KV cache size without compromising accuracy. Additionally, we develop a modality-aware token scoring policy to better evaluate the token importance. Empirical results on multiple benchmark datasets demonstrate that retaining only 10% of KV cache achieves accuracy comparable to that with full cache. In a speed benchmark, our method accelerates end-to-end latency of generating 100 tokens by up to 2.33x and speeds up decoding by up to 7.08x, while reducing the memory footprint of KV cache in GPU by 90%.

VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration

TL;DR

VL-Cache targets the KV cache bottleneck in vision-language model inference by introducing sparsity-aware, layer-specific budget allocation guided by post-vision attention and a modality-aware token scoring policy. By tailoring cache retention to per-prompt sparsity patterns and emphasizing post-vision token importance, it achieves near full-cache accuracy with as little as 10% of the KV cache and delivers substantial speedups, especially for long outputs. Experiments across Coco-Caption, DocVQA, and MathVista with LLaVA backbones demonstrate robust accuracy gains and up to 7x decoding speedups, with a 90% reduction in KV cache memory usage. The approach offers practical impact for real-time VLM deployments requiring reduced memory footprints and lower latency on long-context multimodal inputs.

Abstract

Vision-Language Models (VLMs) have demonstrated impressive performance across a versatile set of tasks. A key challenge in accelerating VLMs is storing and accessing the large Key-Value (KV) cache that encodes long visual contexts, such as images or videos. While existing KV cache compression methods are effective for Large Language Models (LLMs), directly migrating them to VLMs yields suboptimal accuracy and speedup. To bridge the gap, we propose VL-Cache, a novel KV cache compression recipe tailored for accelerating VLM inference. In this paper, we first investigate the unique sparsity pattern of VLM attention by distinguishing visual and text tokens in prefill and decoding phases. Based on these observations, we introduce a layer-adaptive sparsity-aware cache budget allocation method that effectively distributes the limited cache budget across different layers, further reducing KV cache size without compromising accuracy. Additionally, we develop a modality-aware token scoring policy to better evaluate the token importance. Empirical results on multiple benchmark datasets demonstrate that retaining only 10% of KV cache achieves accuracy comparable to that with full cache. In a speed benchmark, our method accelerates end-to-end latency of generating 100 tokens by up to 2.33x and speeds up decoding by up to 7.08x, while reducing the memory footprint of KV cache in GPU by 90%.

Paper Structure

This paper contains 24 sections, 5 equations, 8 figures, 3 tables, 1 algorithm.

Figures (8)

  • Figure 1: Layer-wise attention sparsity in prefill and decoding phases. Different layers exhibit varying degrees of sparsity; The layer-wise sparsity trend in the decoding phase is similar to that in the prefill phase.
  • Figure 2: Average Cache Hit Rate. The accumulated post-vision attention (ours) demonstrates a higher cache hit rate compared to the other two token scoring policies.
  • Figure 3: VL-Cache Overview. In the prefill stage, the cache budget for each layer is dynamically allocated according to the layer-wise sparsity. Then Post-vision Attention is employed to select both critical visual and language tokens.
  • Figure 4: Evaluation results on different datasets with varied cache budgets. The evaluation metrics are the average of sampled tasks. VL-Cache achieves comparable accuracy against full-cache and outperforms multiple baselines with limited KV cache budget.
  • Figure 5: Server-level throughput v.s. request-level latency curve (prompt length = 2K). Labeled points indicate batch size.
  • ...and 3 more figures