LLaMCAT: Optimizing Large Language Model Inference with Cache Arbitration and Throttling
Zhongchun Zhou, Chengtao Lai, Wei Zhang
TL;DR
LLMs face a memory bandwidth bottleneck during decoding, exacerbated by KV Cache and Group-Query Attention. The authors propose LLaMCAT, a Cache Arbitration and Throttling (CAT) framework that uses MSHR-aware arbitration, load-balancing policies, and two-level dynamic throttling to reduce cache stalls and improve miss-handling throughput. They validate the approach with a memory-trace-driven simulation framework that blends analytical models and cycle-level detail, achieving up to 1.58x speedup under cache-limited conditions and 1.26x over strong baselines in typical scenarios. The work also delivers a hybrid simulation flow to evaluate LLC innovations for LLM workloads, demonstrating practical impact for future hardware platforms targeting memory-bound decoding. Overall, LLaMCAT provides a practical, scalable solution to accelerate LLM inference on LLC-based GPUs and AI accelerators while offering a flexible framework for architectural exploration.
Abstract
Large Language Models (LLMs) have achieved unprecedented success across various applications, but their substantial memory requirements pose significant challenges to current memory system designs, especially during inference. Our work targets last-level cache (LLC) based architectures, including GPUs (e.g., NVIDIA GPUs) and AI accelerators. We introduce LLaMCAT, a novel approach to optimize the LLC for LLM inference. LLaMCAT combines Miss Status Holding Register (MSHR)- and load balance-aware cache arbitration with thread throttling to address stringent bandwidth demands and minimize cache stalls in KV Cache access. We also propose a hybrid simulation framework integrating analytical models with cycle-level simulators via memory traces, balancing architecture detail and efficiency. Experiments demonstrate that LLaMCAT achieves an average speedup of 1.26x when the system is mainly bottlenecked by miss handling throughput, while baselines mostly show negative improvements since they are not optimized for this scenario. When the cache size is also limited, our policy achieves a speedup of 1.58x over the unoptimized version, and a 1.26x improvement over the best baseline (dyncta). Overall, LLaMCAT is the first to target LLM decoding-specific MSHR contention, a gap in previous work. It presents a practical solution for accelerating LLM inference on future hardware platforms.
