Table of Contents
Fetching ...

CalibQuant: 1-Bit KV Cache Quantization for Multimodal LLMs

Insu Han, Zeliang Zhang, Zhiyuan Wang, Yifan Zhu, Susan Liang, Jiani Liu, Haiting Lin, Mingjie Zhao, Chenliang Xu, Kun Wan, Wentian Zhao

TL;DR

CalibQuant tackles the memory bottleneck of KV caches in multimodal LLMs by introducing a plug-and-play 1-bit quantization for visual KV caches. The method combines channel-wise quantization, a post-quantization calibration of pre-softmax scores, and a post-scaling technique to deflate dequantization cost, all implemented with Triton kernels for high throughput. Across image captioning, document VQA, and video understanding tasks, CalibQuant achieves substantial memory savings and up to roughly 11x decoding throughput improvements with minimal accuracy loss. These results demonstrate practical, scalable acceleration for deployed MLLMs on memory-constrained hardware, enabling longer or more interactive multimodal inference. The work provides a principled approach to visual-token quantization, with extensive ablations validating the importance of calibration and value-cache channel-wise treatment.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance across diverse applications. However, their computational overhead during deployment remains a critical bottleneck. While Key-Value (KV) caching effectively trades memory for computation to enhance inference efficiency, the growing memory footprint from extensive KV caches significantly reduces throughput and restricts prolonged deployment on memory-constrained GPU devices. To address this challenge, we propose CalibQuant, a simple yet highly effective visual quantization strategy that drastically reduces both memory and computational overhead. Specifically, CalibQuant introduces an extreme 1-bit quantization scheme, complemented by novel post-scaling and calibration techniques tailored to the intrinsic patterns of KV caches, thereby ensuring high efficiency without compromising model performance. Leveraging Triton for runtime optimization, we achieve a 10x throughput increase on InternVL models. Our method is designed to be plug-and-play, seamlessly integrating with various existing MLLMs without requiring architectural changes. Extensive experiments confirm that our approach significantly reduces memory usage while maintaining computational efficiency and preserving multimodal capabilities. Codes are available at https://github.com/insuhan/calibquant.

CalibQuant: 1-Bit KV Cache Quantization for Multimodal LLMs

TL;DR

CalibQuant tackles the memory bottleneck of KV caches in multimodal LLMs by introducing a plug-and-play 1-bit quantization for visual KV caches. The method combines channel-wise quantization, a post-quantization calibration of pre-softmax scores, and a post-scaling technique to deflate dequantization cost, all implemented with Triton kernels for high throughput. Across image captioning, document VQA, and video understanding tasks, CalibQuant achieves substantial memory savings and up to roughly 11x decoding throughput improvements with minimal accuracy loss. These results demonstrate practical, scalable acceleration for deployed MLLMs on memory-constrained hardware, enabling longer or more interactive multimodal inference. The work provides a principled approach to visual-token quantization, with extensive ablations validating the importance of calibration and value-cache channel-wise treatment.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance across diverse applications. However, their computational overhead during deployment remains a critical bottleneck. While Key-Value (KV) caching effectively trades memory for computation to enhance inference efficiency, the growing memory footprint from extensive KV caches significantly reduces throughput and restricts prolonged deployment on memory-constrained GPU devices. To address this challenge, we propose CalibQuant, a simple yet highly effective visual quantization strategy that drastically reduces both memory and computational overhead. Specifically, CalibQuant introduces an extreme 1-bit quantization scheme, complemented by novel post-scaling and calibration techniques tailored to the intrinsic patterns of KV caches, thereby ensuring high efficiency without compromising model performance. Leveraging Triton for runtime optimization, we achieve a 10x throughput increase on InternVL models. Our method is designed to be plug-and-play, seamlessly integrating with various existing MLLMs without requiring architectural changes. Extensive experiments confirm that our approach significantly reduces memory usage while maintaining computational efficiency and preserving multimodal capabilities. Codes are available at https://github.com/insuhan/calibquant.

Paper Structure

This paper contains 23 sections, 10 equations, 3 figures, 8 tables, 2 algorithms.

Figures (3)

  • Figure 1: Distribution of entries in $q K^T/\sqrt{d}$ without quantization (Exact, green), with quantization (Quant, blue) and calibration on post-quantization (Quant-C, red) across different layers and heads.
  • Figure 2: Mean squared error (MSE) for $\mathrm{softmax}(q K^\top/\sqrt{d})$ across multiple layers. The quantization with calibration (Quant-C, red) shows much lower errors than the quantization only method (Quant, blue).
  • Figure 3: Throughputs of our 2-bit, 1-bit quantization and the baseline (16-bit) across various memory budgets (5 to 30 GB). We use 2 models: $\mathtt{internvl2\_5}$-$\mathtt{8b}$ and $\mathtt{internvl2\_5}$-$\mathtt{26b}$, and the the visual token lengths are $n=3328$ and $8192$. The annotated texts indicate the maximum batch size accommodated within each memory budget.