CacheMind: From Miss Rates to Why -- Natural-Language, Trace-Grounded Reasoning for Cache Replacement
Kaushal Mhapsekar, Azam Ghanbari, Bita Aslrousta, Samira Mirbagher-Ajorpaz
TL;DR
CacheMind reframes cache replacement analysis as interactive, natural-language reasoning over trace data. By combining two retrieval engines (Sieve for fast groundings and Ranger for dynamic, code-generated retrieval) with LLMs, it produces verifiable, trace-grounded explanations of per-event cache behavior. The authors introduce CacheMindBench, a 100-item benchmark with trace-grounded and architectural reasoning tiers, and demonstrate that retrieval precision is critical while reasoning-augmented LLMs (Ranger) substantially improve open-ended questions and code-generation tasks. The work yields actionable insights for bypass strategies, software interventions, and prefetching, illustrating a practical path toward co-design of hardware and software informed by AI-assisted analysis. Open-source artifacts and benchmarks are provided to foster reproducibility and further research in microarchitectural reasoning with AI assistance.
Abstract
Cache replacement remains a challenging problem in CPU microarchitecture, often addressed using hand-crafted heuristics, limiting cache performance. Cache data analysis requires parsing millions of trace entries with manual filtering, making the process slow and non-interactive. To address this, we introduce CacheMind, a conversational tool that uses Retrieval-Augmented Generation (RAG) and Large Language Models (LLMs) to enable semantic reasoning over cache traces. Architects can now ask natural language questions like, "Why is the memory access associated with PC X causing more evictions?", and receive trace-grounded, human-readable answers linked to program semantics for the first time. To evaluate CacheMind, we present CacheMindBench, the first verified benchmark suite for LLM-based reasoning for the cache replacement problem. Using the SIEVE retriever, CacheMind achieves 66.67% on 75 unseen trace-grounded questions and 84.80% on 25 unseen policy-specific reasoning tasks; with RANGER, it achieves 89.33% and 64.80% on the same evaluations. Additionally, with RANGER, CacheMind achieves 100% accuracy on 4 out of 6 categories in the trace-grounded tier of CacheMindBench. Compared to LlamaIndex (10% retrieval success), SIEVE achieves 60% and RANGER achieves 90%, demonstrating that existing Retrieval-Augmented Generation (RAGs) are insufficient for precise, trace-grounded microarchitectural reasoning. We provided four concrete actionable insights derived using CacheMind, wherein bypassing use case improved cache hit rate by 7.66% and speedup by 2.04%, software fix use case gives speedup of 76%, and Mockingjay replacement policy use case gives speedup of 0.7%; showing the utility of CacheMind on non-trivial queries that require a natural-language interface.
