Table of Contents
Fetching ...

Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory

Mirac Suzgun, Mert Yuksekgonul, Federico Bianchi, Dan Jurafsky, James Zou

TL;DR

Dynamic Cheatsheet (DC) presents a test-time, memory-augmented framework that endows black-box LLMs with a persistent external memory. The memory is selectively curated and optionally retrieved to guide problem solving, enabling reuse of strategies and code without gradient updates. Across tasks such as AIME, Game of 24, GPQA-Diamond, Math Equation Balancer, and MMLU-Pro, DC yields substantial accuracy gains for large models, while smaller models show more limited benefits. The approach reduces repetitive errors, improves tool use, and demonstrates a practical path toward experience-driven reasoning in deployed LLMs.

Abstract

Despite their impressive performance on complex tasks, current language models (LMs) typically operate in a vacuum: Each input query is processed separately, without retaining insights from previous attempts. Here, we present Dynamic Cheatsheet (DC), a lightweight framework that endows a black-box LM with a persistent, evolving memory. Rather than repeatedly re-discovering or re-committing the same solutions and mistakes, DC enables models to store and reuse accumulated strategies, code snippets, and general problem-solving insights at inference time. This test-time learning enhances performance substantially across a range of tasks without needing explicit ground-truth labels or human feedback. Leveraging DC, Claude 3.5 Sonnet's accuracy more than doubled on AIME math exams once it began retaining algebraic insights across questions. Similarly, GPT-4o's success rate on Game of 24 increased from 10% to 99% after the model discovered and reused a Python-based solution. In tasks prone to arithmetic mistakes, such as balancing equations, DC enabled GPT-4o and Claude to reach near-perfect accuracy by recalling previously validated code, whereas their baselines stagnated around 50%. Beyond arithmetic challenges, DC yields notable accuracy gains on knowledge-demanding tasks. Claude achieved a 9% improvement in GPQA-Diamond and an 8% boost on MMLU-Pro problems. Crucially, DC's memory is self-curated, focusing on concise, transferable snippets rather than entire transcript. Unlike finetuning or static retrieval methods, DC adapts LMs' problem-solving skills on the fly, without modifying their underlying parameters. Overall, our findings present DC as a promising approach for augmenting LMs with persistent memory, bridging the divide between isolated inference events and the cumulative, experience-driven learning characteristic of human cognition.

Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory

TL;DR

Dynamic Cheatsheet (DC) presents a test-time, memory-augmented framework that endows black-box LLMs with a persistent external memory. The memory is selectively curated and optionally retrieved to guide problem solving, enabling reuse of strategies and code without gradient updates. Across tasks such as AIME, Game of 24, GPQA-Diamond, Math Equation Balancer, and MMLU-Pro, DC yields substantial accuracy gains for large models, while smaller models show more limited benefits. The approach reduces repetitive errors, improves tool use, and demonstrates a practical path toward experience-driven reasoning in deployed LLMs.

Abstract

Despite their impressive performance on complex tasks, current language models (LMs) typically operate in a vacuum: Each input query is processed separately, without retaining insights from previous attempts. Here, we present Dynamic Cheatsheet (DC), a lightweight framework that endows a black-box LM with a persistent, evolving memory. Rather than repeatedly re-discovering or re-committing the same solutions and mistakes, DC enables models to store and reuse accumulated strategies, code snippets, and general problem-solving insights at inference time. This test-time learning enhances performance substantially across a range of tasks without needing explicit ground-truth labels or human feedback. Leveraging DC, Claude 3.5 Sonnet's accuracy more than doubled on AIME math exams once it began retaining algebraic insights across questions. Similarly, GPT-4o's success rate on Game of 24 increased from 10% to 99% after the model discovered and reused a Python-based solution. In tasks prone to arithmetic mistakes, such as balancing equations, DC enabled GPT-4o and Claude to reach near-perfect accuracy by recalling previously validated code, whereas their baselines stagnated around 50%. Beyond arithmetic challenges, DC yields notable accuracy gains on knowledge-demanding tasks. Claude achieved a 9% improvement in GPQA-Diamond and an 8% boost on MMLU-Pro problems. Crucially, DC's memory is self-curated, focusing on concise, transferable snippets rather than entire transcript. Unlike finetuning or static retrieval methods, DC adapts LMs' problem-solving skills on the fly, without modifying their underlying parameters. Overall, our findings present DC as a promising approach for augmenting LMs with persistent memory, bridging the divide between isolated inference events and the cumulative, experience-driven learning characteristic of human cognition.

Paper Structure

This paper contains 32 sections, 3 equations, 14 figures, 4 tables.

Figures (14)

  • Figure 2: Overall task performance of Claude 3.5 Sonnet under the baseline prompting approach with minimal instructions (BL) and Dynamic Cheatsheet with Retrieval & Synthesis (DC-RS).
  • Figure 3: Algorithmic illustration of the Dynamic Cheatsheet (DC)-based approaches and other baseline methods. Here, $\texttt{Gen}$ represents the solution generator model, $\texttt{Cur}$ the memory curator, and $\texttt{Retr}$ the retriever. While we use the same black-box LLMs for both generation and curation, we differentiate their roles via task-agnostic instructions (prompts). The retrieval mechanism ranks historical inputs based on cosine similarity with the current query, selecting the most relevant past examples along with their generated solutions.
  • Figure 4: Illustration of Dynamic Cheatsheet (DC-Cu variant).
  • Figure 5: Excerpt from GPT-4o’s external memory after processing 100 examples from Game of 24 under DC-RS. Early in the test sequence, the model discovered a Python-based brute-force solution, stored it, and subsequently retrieved it for subsequent puzzles. This shift to structured code reuse resulted in a dramatic performance increase from 10% to 99% accuracy, eliminating arithmetic errors and redundant problem-solving efforts.
  • Figure 6: Example of Claude 3.5 Sonnet’s curated memory after processing 20 AIME 2024 questions under DC-Cu. The memory captures key solution strategies, enables the model to generalize across similar computational problems, and boosts its accuracy.
  • ...and 9 more figures