Table of Contents
Fetching ...

Cocktail: Chunk-Adaptive Mixed-Precision Quantization for Long-Context LLM Inference

Wei Tao, Bin Zhang, Xiaoyang Qu, Jiguang Wan, Jianzong Wang

TL;DR

Cocktail addresses the memory and latency bottlenecks of long-context LLM inference by introducing chunk-level adaptive mixed-precision quantization for the KV cache. It combines a retrieval-inspired chunk-level quantization search, which assigns FP16/INT4/INT2 bitwidths per context chunk based on cosine similarity to the query, with a hardware-aware chunk-level KV cache computation that reorders chunks to keep identical bitwidths contiguous during inference. Empirical results across multiple models and long-context datasets show that Cocktail improves accuracy compared to SOTA KV-quantization methods, while reducing GPU memory usage by 12–42% and per-token latency (TPOT) by 32–52%, and achieving higher throughput at larger batch sizes. The approach offers practical gains for scalable long-context LLM deployment on commodity GPUs by balancing precision, computation, and memory through chunk-level strategies.

Abstract

Recently, large language models (LLMs) have been able to handle longer and longer contexts. However, a context that is too long may cause intolerant inference latency and GPU memory usage. Existing methods propose mixed-precision quantization to the key-value (KV) cache in LLMs based on token granularity, which is time-consuming in the search process and hardware inefficient during computation. This paper introduces a novel approach called Cocktail, which employs chunk-adaptive mixed-precision quantization to optimize the KV cache. Cocktail consists of two modules: chunk-level quantization search and chunk-level KV cache computation. Chunk-level quantization search determines the optimal bitwidth configuration of the KV cache chunks quickly based on the similarity scores between the corresponding context chunks and the query, maintaining the model accuracy. Furthermore, chunk-level KV cache computation reorders the KV cache chunks before quantization, avoiding the hardware inefficiency caused by mixed-precision quantization in inference computation. Extensive experiments demonstrate that Cocktail outperforms state-of-the-art KV cache quantization methods on various models and datasets.

Cocktail: Chunk-Adaptive Mixed-Precision Quantization for Long-Context LLM Inference

TL;DR

Cocktail addresses the memory and latency bottlenecks of long-context LLM inference by introducing chunk-level adaptive mixed-precision quantization for the KV cache. It combines a retrieval-inspired chunk-level quantization search, which assigns FP16/INT4/INT2 bitwidths per context chunk based on cosine similarity to the query, with a hardware-aware chunk-level KV cache computation that reorders chunks to keep identical bitwidths contiguous during inference. Empirical results across multiple models and long-context datasets show that Cocktail improves accuracy compared to SOTA KV-quantization methods, while reducing GPU memory usage by 12–42% and per-token latency (TPOT) by 32–52%, and achieving higher throughput at larger batch sizes. The approach offers practical gains for scalable long-context LLM deployment on commodity GPUs by balancing precision, computation, and memory through chunk-level strategies.

Abstract

Recently, large language models (LLMs) have been able to handle longer and longer contexts. However, a context that is too long may cause intolerant inference latency and GPU memory usage. Existing methods propose mixed-precision quantization to the key-value (KV) cache in LLMs based on token granularity, which is time-consuming in the search process and hardware inefficient during computation. This paper introduces a novel approach called Cocktail, which employs chunk-adaptive mixed-precision quantization to optimize the KV cache. Cocktail consists of two modules: chunk-level quantization search and chunk-level KV cache computation. Chunk-level quantization search determines the optimal bitwidth configuration of the KV cache chunks quickly based on the similarity scores between the corresponding context chunks and the query, maintaining the model accuracy. Furthermore, chunk-level KV cache computation reorders the KV cache chunks before quantization, avoiding the hardware inefficiency caused by mixed-precision quantization in inference computation. Extensive experiments demonstrate that Cocktail outperforms state-of-the-art KV cache quantization methods on various models and datasets.

Paper Structure

This paper contains 16 sections, 5 equations, 7 figures, 5 tables, 1 algorithm.

Figures (7)

  • Figure 1: The similarity heatmap between a long passage and 10 different queries. Most of the passage chunks are irrelevant to the query.
  • Figure 2: The architecture overview of Cocktail. (a) The chunk-level quantization search module. (b) The chunk-level KV cache computation module.
  • Figure 3: The process of KV cache chunk reordering.
  • Figure 4: GPU memory of different models.
  • Figure 5: Time per output token (TPOT) of different models.
  • ...and 2 more figures