Table of Contents
Fetching ...

Titanus: Enabling KV Cache Pruning and Quantization On-the-Fly for LLM Acceleration

Peilin Chen, Xiaoxuan Yang

TL;DR

Titanus addresses the KV cache bottleneck in LLM inference by enabling on-the-fly pruning and quantization via a software-hardware co-design built on computing-in-memory. The key contributions are the cascade pruning-quantization (CPQ) method, the Hierarchical Quantization Extension (HQE) to handle non-independent per-channel quantization, and a two-stage design space exploration to optimize layer-specific pruning and quantization; a specialized dataflow (ICPI/ICPA) further reduces time-to-first-token. Experimental results show dramatic reductions in off-chip memory movement and large gains in energy efficiency and throughput compared with Nvidia A100 and FlightLLM, while maintaining negligible accuracy loss. The work demonstrates a practical path to accelerate large-scale LLMs through KV cache compression and CIM-based hardware design.

Abstract

Large language models (LLMs) have gained great success in various domains. Existing systems cache Key and Value within the attention block to avoid redundant computations. However, the size of key-value cache (KV cache) is unpredictable and can even be tens of times larger than the weights in the long context length scenario. In this work, we propose Titanus, a software-hardware co-design to efficiently compress the KV cache on-the-fly. We first propose the cascade pruning-quantization (CPQ) method to reduce the KV cache movement. The hierarchical quantization extension strategy is introduced to tackle the non-independent per-channel quantization issue. To further reduce KV cache movement, we transfer only the non-zero KV cache between the accelerator and off-chip memory. Moreover, we customize a two-stage design space exploration framework for the CPQ method. A novel pipeline and parallelism dataflow is designed to reduce the first token generation time. Experiments show that Titanus achieves 159.9x (49.6x) and 34.8x (29.2x) energy efficiency (throughput) compared to Nvidia A100 GPU and FlightLLM respectively. The code for Titanus is available at https://github.com/peilin-chen/Titanus-for-LLM-acceleration.

Titanus: Enabling KV Cache Pruning and Quantization On-the-Fly for LLM Acceleration

TL;DR

Titanus addresses the KV cache bottleneck in LLM inference by enabling on-the-fly pruning and quantization via a software-hardware co-design built on computing-in-memory. The key contributions are the cascade pruning-quantization (CPQ) method, the Hierarchical Quantization Extension (HQE) to handle non-independent per-channel quantization, and a two-stage design space exploration to optimize layer-specific pruning and quantization; a specialized dataflow (ICPI/ICPA) further reduces time-to-first-token. Experimental results show dramatic reductions in off-chip memory movement and large gains in energy efficiency and throughput compared with Nvidia A100 and FlightLLM, while maintaining negligible accuracy loss. The work demonstrates a practical path to accelerate large-scale LLMs through KV cache compression and CIM-based hardware design.

Abstract

Large language models (LLMs) have gained great success in various domains. Existing systems cache Key and Value within the attention block to avoid redundant computations. However, the size of key-value cache (KV cache) is unpredictable and can even be tens of times larger than the weights in the long context length scenario. In this work, we propose Titanus, a software-hardware co-design to efficiently compress the KV cache on-the-fly. We first propose the cascade pruning-quantization (CPQ) method to reduce the KV cache movement. The hierarchical quantization extension strategy is introduced to tackle the non-independent per-channel quantization issue. To further reduce KV cache movement, we transfer only the non-zero KV cache between the accelerator and off-chip memory. Moreover, we customize a two-stage design space exploration framework for the CPQ method. A novel pipeline and parallelism dataflow is designed to reduce the first token generation time. Experiments show that Titanus achieves 159.9x (49.6x) and 34.8x (29.2x) energy efficiency (throughput) compared to Nvidia A100 GPU and FlightLLM respectively. The code for Titanus is available at https://github.com/peilin-chen/Titanus-for-LLM-acceleration.

Paper Structure

This paper contains 20 sections, 14 figures, 2 tables.

Figures (14)

  • Figure 1: Overview of cascade pruning-quantization method.
  • Figure 2: Hierarchical quantization extension strategy.
  • Figure 3: Sensitivity of KV cache across different layers to pruning threshold (left) and quantization bit-width (right).
  • Figure 4: Titanus core-level overall architecture. CE and SZ denote the computing engine and scale-zero buffer, respectively.
  • Figure 5: Computing engine design for dot-product attention with zero detection functionality.
  • ...and 9 more figures