Table of Contents
Fetching ...

LLaMCAT: Optimizing Large Language Model Inference with Cache Arbitration and Throttling

Zhongchun Zhou, Chengtao Lai, Wei Zhang

TL;DR

LLMs face a memory bandwidth bottleneck during decoding, exacerbated by KV Cache and Group-Query Attention. The authors propose LLaMCAT, a Cache Arbitration and Throttling (CAT) framework that uses MSHR-aware arbitration, load-balancing policies, and two-level dynamic throttling to reduce cache stalls and improve miss-handling throughput. They validate the approach with a memory-trace-driven simulation framework that blends analytical models and cycle-level detail, achieving up to 1.58x speedup under cache-limited conditions and 1.26x over strong baselines in typical scenarios. The work also delivers a hybrid simulation flow to evaluate LLC innovations for LLM workloads, demonstrating practical impact for future hardware platforms targeting memory-bound decoding. Overall, LLaMCAT provides a practical, scalable solution to accelerate LLM inference on LLC-based GPUs and AI accelerators while offering a flexible framework for architectural exploration.

Abstract

Large Language Models (LLMs) have achieved unprecedented success across various applications, but their substantial memory requirements pose significant challenges to current memory system designs, especially during inference. Our work targets last-level cache (LLC) based architectures, including GPUs (e.g., NVIDIA GPUs) and AI accelerators. We introduce LLaMCAT, a novel approach to optimize the LLC for LLM inference. LLaMCAT combines Miss Status Holding Register (MSHR)- and load balance-aware cache arbitration with thread throttling to address stringent bandwidth demands and minimize cache stalls in KV Cache access. We also propose a hybrid simulation framework integrating analytical models with cycle-level simulators via memory traces, balancing architecture detail and efficiency. Experiments demonstrate that LLaMCAT achieves an average speedup of 1.26x when the system is mainly bottlenecked by miss handling throughput, while baselines mostly show negative improvements since they are not optimized for this scenario. When the cache size is also limited, our policy achieves a speedup of 1.58x over the unoptimized version, and a 1.26x improvement over the best baseline (dyncta). Overall, LLaMCAT is the first to target LLM decoding-specific MSHR contention, a gap in previous work. It presents a practical solution for accelerating LLM inference on future hardware platforms.

LLaMCAT: Optimizing Large Language Model Inference with Cache Arbitration and Throttling

TL;DR

LLMs face a memory bandwidth bottleneck during decoding, exacerbated by KV Cache and Group-Query Attention. The authors propose LLaMCAT, a Cache Arbitration and Throttling (CAT) framework that uses MSHR-aware arbitration, load-balancing policies, and two-level dynamic throttling to reduce cache stalls and improve miss-handling throughput. They validate the approach with a memory-trace-driven simulation framework that blends analytical models and cycle-level detail, achieving up to 1.58x speedup under cache-limited conditions and 1.26x over strong baselines in typical scenarios. The work also delivers a hybrid simulation flow to evaluate LLC innovations for LLM workloads, demonstrating practical impact for future hardware platforms targeting memory-bound decoding. Overall, LLaMCAT provides a practical, scalable solution to accelerate LLM inference on LLC-based GPUs and AI accelerators while offering a flexible framework for architectural exploration.

Abstract

Large Language Models (LLMs) have achieved unprecedented success across various applications, but their substantial memory requirements pose significant challenges to current memory system designs, especially during inference. Our work targets last-level cache (LLC) based architectures, including GPUs (e.g., NVIDIA GPUs) and AI accelerators. We introduce LLaMCAT, a novel approach to optimize the LLC for LLM inference. LLaMCAT combines Miss Status Holding Register (MSHR)- and load balance-aware cache arbitration with thread throttling to address stringent bandwidth demands and minimize cache stalls in KV Cache access. We also propose a hybrid simulation framework integrating analytical models with cycle-level simulators via memory traces, balancing architecture detail and efficiency. Experiments demonstrate that LLaMCAT achieves an average speedup of 1.26x when the system is mainly bottlenecked by miss handling throughput, while baselines mostly show negative improvements since they are not optimized for this scenario. When the cache size is also limited, our policy achieves a speedup of 1.58x over the unoptimized version, and a 1.26x improvement over the best baseline (dyncta). Overall, LLaMCAT is the first to target LLM decoding-specific MSHR contention, a gap in previous work. It presents a practical solution for accelerating LLM inference on future hardware platforms.

Paper Structure

This paper contains 36 sections, 9 figures, 5 tables, 1 algorithm.

Figures (9)

  • Figure 1: KV Cache Mechanism
  • Figure 2: Group-Query Attention
  • Figure 3: Example architectures applicable to this work. (a) GPGPU (b) Datacenter-level AI SoC from Ascend. SM: Stream Multiprocessor; MC: Memory Controller; CHN: Channel.
  • Figure 4: System assumption. Only 1 LLC slice and its corresponding arbiter is shown for simplicity. A slice comprises 1 or more cache sets. Items in red are our design, while others are the baseline. THRTL CTRL: throttling control unit; cnt: counter for the number of requests served for each core.
  • Figure 5: The process of selecting a request from the request queue and updating sent_reqs. spec_hit_result: speculated hit result. In this example, 0x00 and 0xc0 are inferred as cache hits (colored in grey), so they are not counted when estimating MSHR entries used.
  • ...and 4 more figures