Table of Contents
Fetching ...

CacheMind: From Miss Rates to Why -- Natural-Language, Trace-Grounded Reasoning for Cache Replacement

Kaushal Mhapsekar, Azam Ghanbari, Bita Aslrousta, Samira Mirbagher-Ajorpaz

TL;DR

CacheMind reframes cache replacement analysis as interactive, natural-language reasoning over trace data. By combining two retrieval engines (Sieve for fast groundings and Ranger for dynamic, code-generated retrieval) with LLMs, it produces verifiable, trace-grounded explanations of per-event cache behavior. The authors introduce CacheMindBench, a 100-item benchmark with trace-grounded and architectural reasoning tiers, and demonstrate that retrieval precision is critical while reasoning-augmented LLMs (Ranger) substantially improve open-ended questions and code-generation tasks. The work yields actionable insights for bypass strategies, software interventions, and prefetching, illustrating a practical path toward co-design of hardware and software informed by AI-assisted analysis. Open-source artifacts and benchmarks are provided to foster reproducibility and further research in microarchitectural reasoning with AI assistance.

Abstract

Cache replacement remains a challenging problem in CPU microarchitecture, often addressed using hand-crafted heuristics, limiting cache performance. Cache data analysis requires parsing millions of trace entries with manual filtering, making the process slow and non-interactive. To address this, we introduce CacheMind, a conversational tool that uses Retrieval-Augmented Generation (RAG) and Large Language Models (LLMs) to enable semantic reasoning over cache traces. Architects can now ask natural language questions like, "Why is the memory access associated with PC X causing more evictions?", and receive trace-grounded, human-readable answers linked to program semantics for the first time. To evaluate CacheMind, we present CacheMindBench, the first verified benchmark suite for LLM-based reasoning for the cache replacement problem. Using the SIEVE retriever, CacheMind achieves 66.67% on 75 unseen trace-grounded questions and 84.80% on 25 unseen policy-specific reasoning tasks; with RANGER, it achieves 89.33% and 64.80% on the same evaluations. Additionally, with RANGER, CacheMind achieves 100% accuracy on 4 out of 6 categories in the trace-grounded tier of CacheMindBench. Compared to LlamaIndex (10% retrieval success), SIEVE achieves 60% and RANGER achieves 90%, demonstrating that existing Retrieval-Augmented Generation (RAGs) are insufficient for precise, trace-grounded microarchitectural reasoning. We provided four concrete actionable insights derived using CacheMind, wherein bypassing use case improved cache hit rate by 7.66% and speedup by 2.04%, software fix use case gives speedup of 76%, and Mockingjay replacement policy use case gives speedup of 0.7%; showing the utility of CacheMind on non-trivial queries that require a natural-language interface.

CacheMind: From Miss Rates to Why -- Natural-Language, Trace-Grounded Reasoning for Cache Replacement

TL;DR

CacheMind reframes cache replacement analysis as interactive, natural-language reasoning over trace data. By combining two retrieval engines (Sieve for fast groundings and Ranger for dynamic, code-generated retrieval) with LLMs, it produces verifiable, trace-grounded explanations of per-event cache behavior. The authors introduce CacheMindBench, a 100-item benchmark with trace-grounded and architectural reasoning tiers, and demonstrate that retrieval precision is critical while reasoning-augmented LLMs (Ranger) substantially improve open-ended questions and code-generation tasks. The work yields actionable insights for bypass strategies, software interventions, and prefetching, illustrating a practical path toward co-design of hardware and software informed by AI-assisted analysis. Open-source artifacts and benchmarks are provided to foster reproducibility and further research in microarchitectural reasoning with AI assistance.

Abstract

Cache replacement remains a challenging problem in CPU microarchitecture, often addressed using hand-crafted heuristics, limiting cache performance. Cache data analysis requires parsing millions of trace entries with manual filtering, making the process slow and non-interactive. To address this, we introduce CacheMind, a conversational tool that uses Retrieval-Augmented Generation (RAG) and Large Language Models (LLMs) to enable semantic reasoning over cache traces. Architects can now ask natural language questions like, "Why is the memory access associated with PC X causing more evictions?", and receive trace-grounded, human-readable answers linked to program semantics for the first time. To evaluate CacheMind, we present CacheMindBench, the first verified benchmark suite for LLM-based reasoning for the cache replacement problem. Using the SIEVE retriever, CacheMind achieves 66.67% on 75 unseen trace-grounded questions and 84.80% on 25 unseen policy-specific reasoning tasks; with RANGER, it achieves 89.33% and 64.80% on the same evaluations. Additionally, with RANGER, CacheMind achieves 100% accuracy on 4 out of 6 categories in the trace-grounded tier of CacheMindBench. Compared to LlamaIndex (10% retrieval success), SIEVE achieves 60% and RANGER achieves 90%, demonstrating that existing Retrieval-Augmented Generation (RAGs) are insufficient for precise, trace-grounded microarchitectural reasoning. We provided four concrete actionable insights derived using CacheMind, wherein bypassing use case improved cache hit rate by 7.66% and speedup by 2.04%, software fix use case gives speedup of 76%, and Mockingjay replacement policy use case gives speedup of 0.7%; showing the utility of CacheMind on non-trivial queries that require a natural-language interface.
Paper Structure (43 sections, 13 figures, 2 tables)

This paper contains 43 sections, 13 figures, 2 tables.

Figures (13)

  • Figure 1: The method filters raw traces to a task-specific slice and returns the most informative evidence for the user’s query. Old ChampSim could tell you a miss; CacheMind shows which PC missed on which data, under which policy, and why, for every event, acting as a microarchitectural microscope that turns raw traces into per-PC, cross-policy answers.
  • Figure 2: Example trace excerpt retrieved by CacheMind
  • Figure 3: Formatted system prompt summarizing the data container, dataframe schema, metadata string, task flow, and strict output rules for code generation with CacheMind-Ranger. The ability to compare, find the root cause of performance differences among designs, and to judge replacement policies accurately, is the essential first step toward automating cache policy debugging and design assistant using AI.
  • Figure 4: Accuracy of CacheMind with different LLM backends across CacheMindBench categories.
  • Figure 5: Accuracy across retrieval-context quality (Low/Medium/High) for each backend paired with CacheMind. It shows that the retrieval quality is the precondition for cache replacement policy high level reasoning.
  • ...and 8 more figures