CalibQuant: 1-Bit KV Cache Quantization for Multimodal LLMs
Insu Han, Zeliang Zhang, Zhiyuan Wang, Yifan Zhu, Susan Liang, Jiani Liu, Haiting Lin, Mingjie Zhao, Chenliang Xu, Kun Wan, Wentian Zhao
TL;DR
CalibQuant tackles the memory bottleneck of KV caches in multimodal LLMs by introducing a plug-and-play 1-bit quantization for visual KV caches. The method combines channel-wise quantization, a post-quantization calibration of pre-softmax scores, and a post-scaling technique to deflate dequantization cost, all implemented with Triton kernels for high throughput. Across image captioning, document VQA, and video understanding tasks, CalibQuant achieves substantial memory savings and up to roughly 11x decoding throughput improvements with minimal accuracy loss. These results demonstrate practical, scalable acceleration for deployed MLLMs on memory-constrained hardware, enabling longer or more interactive multimodal inference. The work provides a principled approach to visual-token quantization, with extensive ablations validating the importance of calibration and value-cache channel-wise treatment.
Abstract
Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance across diverse applications. However, their computational overhead during deployment remains a critical bottleneck. While Key-Value (KV) caching effectively trades memory for computation to enhance inference efficiency, the growing memory footprint from extensive KV caches significantly reduces throughput and restricts prolonged deployment on memory-constrained GPU devices. To address this challenge, we propose CalibQuant, a simple yet highly effective visual quantization strategy that drastically reduces both memory and computational overhead. Specifically, CalibQuant introduces an extreme 1-bit quantization scheme, complemented by novel post-scaling and calibration techniques tailored to the intrinsic patterns of KV caches, thereby ensuring high efficiency without compromising model performance. Leveraging Triton for runtime optimization, we achieve a 10x throughput increase on InternVL models. Our method is designed to be plug-and-play, seamlessly integrating with various existing MLLMs without requiring architectural changes. Extensive experiments confirm that our approach significantly reduces memory usage while maintaining computational efficiency and preserving multimodal capabilities. Codes are available at https://github.com/insuhan/calibquant.
