Table of Contents
Fetching ...

Oaken: Fast and Efficient LLM Serving with Online-Offline Hybrid KV Cache Quantization

Minsu Kim, Seongmin Hong, RyeoWook Ko, Soongyu Choi, Hunjong Lee, Junsoo Kim, Joo-Young Kim, Jongse Park

TL;DR

The paper tackles memory bandwidth and capacity bottlenecks in batched LLM serving caused by expanding KV caches. It proposes Oaken, a co-design of offline-threshold-based KV cache quantization and specialized hardware that performs per-token quantization, group-shift compression, and fused dense-and-sparse encoding, integrated into an LLM accelerator with token-level batching. Key technical contributions include offline threshold profiling for $T_{lo}^o$, $T_{lo}^i$, $T_{hi}^i$, $T_{hi}^o$, a three-group quantization scheme (middle group at 4 bits, outer/inner at 5 bits), and a dense-sparse fusion that reduces outlier storage to 8 bits per entry, all orchestrated by a memory-management-aware DMA pipeline. Experiments show up to 1.58x throughput gains over NVIDIA A100 with a modest average accuracy loss around 0.54% and 8.21% area overhead, demonstrating practical viability for scalable, cost-effective LLM serving across various models and long sequences.

Abstract

Modern Large Language Model serving system batches multiple requests to achieve high throughput, while batching attention operations is challenging, rendering memory bandwidth a critical bottleneck. The community relies on high-end GPUs with multiple high-bandwidth memory channels. Unfortunately, HBM's high bandwidth often comes at the expense of limited memory capacity, which reduces core utilization and increases costs. Recent advancements enabling longer contexts for LLMs have substantially increased the key-value cache size, further intensifying the pressures on memory capacity. The literature has explored KV cache quantization techniques, which commonly use low bitwidth for most values, selectively using higher bitwidth for outlier values. While this approach helps achieve high accuracy and low bitwidth simultaneously, it comes with the limitation that cost for online outlier detection is excessively high, negating the advantages. We propose Oaken, an acceleration solution that achieves high accuracy and high performance simultaneously through co-designing algorithm and hardware. To effectively find a sweet spot in the accuracy-performance trade-off space of KV cache quantization, Oaken employs an online-offline hybrid approach, setting outlier thresholds offline, which are then used to determine the quantization scale online. To translate the proposed algorithmic technique into tangible performance gains, Oaken also comes with custom quantization engines and memory management units that can be integrated with any LLM accelerators. We built an Oaken accelerator on top of an LLM accelerator, LPU, and conducted a comprehensive evaluation. Our experiments show that for a batch size of 256, Oaken achieves up to 1.58x throughput improvement over NVIDIA A100 GPU, incurring a minimal accuracy loss of only 0.54\% on average, compared to state-of-the-art KV cache quantization techniques.

Oaken: Fast and Efficient LLM Serving with Online-Offline Hybrid KV Cache Quantization

TL;DR

The paper tackles memory bandwidth and capacity bottlenecks in batched LLM serving caused by expanding KV caches. It proposes Oaken, a co-design of offline-threshold-based KV cache quantization and specialized hardware that performs per-token quantization, group-shift compression, and fused dense-and-sparse encoding, integrated into an LLM accelerator with token-level batching. Key technical contributions include offline threshold profiling for , , , , a three-group quantization scheme (middle group at 4 bits, outer/inner at 5 bits), and a dense-sparse fusion that reduces outlier storage to 8 bits per entry, all orchestrated by a memory-management-aware DMA pipeline. Experiments show up to 1.58x throughput gains over NVIDIA A100 with a modest average accuracy loss around 0.54% and 8.21% area overhead, demonstrating practical viability for scalable, cost-effective LLM serving across various models and long sequences.

Abstract

Modern Large Language Model serving system batches multiple requests to achieve high throughput, while batching attention operations is challenging, rendering memory bandwidth a critical bottleneck. The community relies on high-end GPUs with multiple high-bandwidth memory channels. Unfortunately, HBM's high bandwidth often comes at the expense of limited memory capacity, which reduces core utilization and increases costs. Recent advancements enabling longer contexts for LLMs have substantially increased the key-value cache size, further intensifying the pressures on memory capacity. The literature has explored KV cache quantization techniques, which commonly use low bitwidth for most values, selectively using higher bitwidth for outlier values. While this approach helps achieve high accuracy and low bitwidth simultaneously, it comes with the limitation that cost for online outlier detection is excessively high, negating the advantages. We propose Oaken, an acceleration solution that achieves high accuracy and high performance simultaneously through co-designing algorithm and hardware. To effectively find a sweet spot in the accuracy-performance trade-off space of KV cache quantization, Oaken employs an online-offline hybrid approach, setting outlier thresholds offline, which are then used to determine the quantization scale online. To translate the proposed algorithmic technique into tangible performance gains, Oaken also comes with custom quantization engines and memory management units that can be integrated with any LLM accelerators. We built an Oaken accelerator on top of an LLM accelerator, LPU, and conducted a comprehensive evaluation. Our experiments show that for a batch size of 256, Oaken achieves up to 1.58x throughput improvement over NVIDIA A100 GPU, incurring a minimal accuracy loss of only 0.54\% on average, compared to state-of-the-art KV cache quantization techniques.

Paper Structure

This paper contains 24 sections, 4 equations, 14 figures, 4 tables.

Figures (14)

  • Figure 1: Existing solutions for LLM inference serving systems plotted on the bandwidth-capacity trade-off space. The "effective bandwidth" and "effective capacity" represent the scale of data that can be transmitted to/from and stored on memory, respectively. We also delineate their corresponding throughput (i.e., tokens/sec) using the colors presented on a 1D heatmap on the right side.
  • Figure 2: (a) Structure of LLM inference and decoder layer during the prefill and generation phases. (b) Operations in the multi-head attention layer, including activation-weight and activation-activation operations, during the generation phase of batched inference for three requests.
  • Figure 3: Characteristic analysis of LLM inference for (a) single request and (b) batched multiple requests. (c) Utilization measurement during the generation phase with batched multiple requests using NVIDIA A100 GPU.
  • Figure 4: Throughput of accelerators equipped with HBM and LPDDR memory when using (a) Llama2-13B and (b) OPT-30B (OOM refers to "Out-of-Memory."). (c) Accelerator specification with HBM and LPDDR memory.
  • Figure 5: (a) Memory usage breakdown to KV cache and model parameters of Llama2-13B model as batch size sweeps from 1 to 256. (b) Throughput comparison among no quantization, weight and KV cache quantization of Llama2-13B model inference. Experiment is conducted with LPDDR-NPU.
  • ...and 9 more figures