Table of Contents
Fetching ...

UniCAIM: A Unified CAM/CIM Architecture with Static-Dynamic KV Cache Pruning for Efficient Long-Context LLM Inference

Weikai Xu, Wenxuan Zeng, Qianqian Huang, Meng Li, Ru Huang

TL;DR

UniCAIM tackles the memory and compute bottlenecks of long-context LLM inference by integrating a unified CAM/CIM architecture with a hybrid static-dynamic KV cache pruning strategy. It introduces FeFET-based UniCAIM cells that support three operation modes: CAM for fast approximate top-k pruning, charge-domain CIM for static pruning via accumulative similarity, and current-domain CIM for exact attention on a selected KV subset. The approach yields dramatic AEDP reductions (8.2× to 831×) with accuracy comparable to dense attention on long-context tasks, demonstrating strong potential for efficient edge inference of large language models.

Abstract

Transformer-based large language models (LLMs) have achieved impressive performance in various natural language processing (NLP) applications. However, the high memory and computation cost induced by the KV cache limits the inference efficiency, especially for long input sequences. Compute-in-memory (CIM)-based accelerators have been proposed for LLM acceleration with KV cache pruning. However, as existing accelerators only support static pruning with a fixed pattern or dynamic pruning with primitive implementations, they suffer from either high accuracy degradation or low efficiency. In this paper, we propose a ferroelectric FET (FeFET)-based unified content addressable memory (CAM) and CIM architecture, dubbed as UniCAIM. UniCAIM features simultaneous support for static and dynamic pruning with 3 computation modes: 1) in the CAM mode, UniCAIM enables approximate similarity measurement in O(1) time for dynamic KV cache pruning with high energy efficiency; 2) in the charge-domain CIM mode, static pruning can be supported based on accumulative similarity score, which is much more flexible compared to fixed patterns; 3) in the current-domain mode, exact attention computation can be conducted with a subset of selected KV cache. We further propose a novel CAM/CIM cell design that leverages the multi-level characteristics of FeFETs for signed multibit storage of the KV cache and in-place attention computation. With extensive experimental results, we demonstrate UniCAIM can reduce the area-energy-delay product (AEDP) by 8.2-831x over the state-ofthe-art CIM-based LLM accelerators at the circuit level, along with high accuracy comparable with dense attention at the application level, showing its great potential for efficient long-context LLM inference.

UniCAIM: A Unified CAM/CIM Architecture with Static-Dynamic KV Cache Pruning for Efficient Long-Context LLM Inference

TL;DR

UniCAIM tackles the memory and compute bottlenecks of long-context LLM inference by integrating a unified CAM/CIM architecture with a hybrid static-dynamic KV cache pruning strategy. It introduces FeFET-based UniCAIM cells that support three operation modes: CAM for fast approximate top-k pruning, charge-domain CIM for static pruning via accumulative similarity, and current-domain CIM for exact attention on a selected KV subset. The approach yields dramatic AEDP reductions (8.2× to 831×) with accuracy comparable to dense attention on long-context tasks, demonstrating strong potential for efficient edge inference of large language models.

Abstract

Transformer-based large language models (LLMs) have achieved impressive performance in various natural language processing (NLP) applications. However, the high memory and computation cost induced by the KV cache limits the inference efficiency, especially for long input sequences. Compute-in-memory (CIM)-based accelerators have been proposed for LLM acceleration with KV cache pruning. However, as existing accelerators only support static pruning with a fixed pattern or dynamic pruning with primitive implementations, they suffer from either high accuracy degradation or low efficiency. In this paper, we propose a ferroelectric FET (FeFET)-based unified content addressable memory (CAM) and CIM architecture, dubbed as UniCAIM. UniCAIM features simultaneous support for static and dynamic pruning with 3 computation modes: 1) in the CAM mode, UniCAIM enables approximate similarity measurement in O(1) time for dynamic KV cache pruning with high energy efficiency; 2) in the charge-domain CIM mode, static pruning can be supported based on accumulative similarity score, which is much more flexible compared to fixed patterns; 3) in the current-domain mode, exact attention computation can be conducted with a subset of selected KV cache. We further propose a novel CAM/CIM cell design that leverages the multi-level characteristics of FeFETs for signed multibit storage of the KV cache and in-place attention computation. With extensive experimental results, we demonstrate UniCAIM can reduce the area-energy-delay product (AEDP) by 8.2-831x over the state-ofthe-art CIM-based LLM accelerators at the circuit level, along with high accuracy comparable with dense attention at the application level, showing its great potential for efficient long-context LLM inference.

Paper Structure

This paper contains 24 sections, 1 equation, 15 figures.

Figures (15)

  • Figure 1: (a) Various NLP tasks with increasing sequence length. (b) The impact of sequence length on KV cache size and attention latency in Llama-2-7B, which is a typical LLM, indicating the memory and computation challenges faced by long-context LLMs.
  • Figure 2: (a) Typical device structure of ferroelectric FET (FeFET). (b) FE polarization-voltage loops with multilevel FE polarizations. (c) Gradually modulated ID-VG curves of FeFET for the multilevel storage capability.
  • Figure I: Qualitative comparison of proposed FeFET-based UniCAIM with the state-of-the-art CIM-based LLM accelerators.
  • Figure III: Framework of the proposed hybrid static-dynamic KV cache pruning algorithm. (a) During the prefill stage, static pruning evicts unimportant tokens for the subsequent generation. (b) During the decoding stage, dynamic pruning preserves a subset of tokens for sparse attention computation, while static pruning evicts one token at each step when the generated length exceeds the reserved size for a fixed KV cache size.
  • Figure IV: (a) The proposed UniCAIM architecture for static-dynamic KV cache pruning and sparse attention computing. (b) The hardware design of UniCAIM based on FeFET, including FeFET-based UniCAIM array and carefully designed peripheral circuits for CAM, charge-domain and current-domain CAM.
  • ...and 10 more figures