Table of Contents
Fetching ...

ArcMemo: Abstract Reasoning Composition with Lifelong LLM Memory

Matthew Ho, Chen Si, Zhaoxiang Feng, Fangxu Yu, Yichi Yang, Zhijian Liu, Zhiting Hu, Lianhui Qin

TL;DR

ArcMemo addresses the memory bottleneck in deploying large language models for long-horizon reasoning by introducing concept-level external memory that stores reusable, modular abstractions rather than problem-specific patterns. It presents two memory formats—Open-Ended (OE) and Program Synthesis (PS)—to promote abstraction and modularity, and demonstrates that PS-based concept memory yields the strongest gains on ARC-AGI, especially as inference compute scales. The approach enables test-time continual learning via selective retrieval and iterative memory updates, with empirical evidence that memory selection and continual updates improve performance. The work provides a foundation for lifelong abstract reasoning in LLMs and releases resources to support future studies on abstraction-based memory.

Abstract

While inference-time scaling enables LLMs to carry out increasingly long and capable reasoning traces, the patterns and insights uncovered during these traces are immediately discarded once the context window is reset for a new query. External memory is a natural way to persist these discoveries, and recent work has shown clear benefits for reasoning-intensive tasks. We see an opportunity to make such memories more broadly reusable and scalable by moving beyond instance-based memory entries (e.g. exact query/response pairs, or summaries tightly coupled with the original problem context) toward concept-level memory: reusable, modular abstractions distilled from solution traces and stored in natural language. For future queries, relevant concepts are selectively retrieved and integrated into the prompt, enabling test-time continual learning without weight updates. Our design introduces new strategies for abstracting takeaways from rollouts and retrieving entries for new queries, promoting reuse and allowing memory to expand with additional experiences. We evaluate on ARC-AGI, a benchmark that stresses compositional generalization and abstract reasoning, making it a natural fit for concept memory. Our method yields a 7.5% relative gain over a strong no-memory baseline with performance continuing to scale with inference compute. We find abstract concepts to be the most consistent memory design, outscoring the baseline at all tested inference compute scales. Moreover, dynamically updating memory during test-time outperforms fixed settings, supporting the hypothesis that accumulating and abstracting patterns enables further solutions in a form of self-improvement. Code is available at https://github.com/matt-seb-ho/arc_memo.

ArcMemo: Abstract Reasoning Composition with Lifelong LLM Memory

TL;DR

ArcMemo addresses the memory bottleneck in deploying large language models for long-horizon reasoning by introducing concept-level external memory that stores reusable, modular abstractions rather than problem-specific patterns. It presents two memory formats—Open-Ended (OE) and Program Synthesis (PS)—to promote abstraction and modularity, and demonstrates that PS-based concept memory yields the strongest gains on ARC-AGI, especially as inference compute scales. The approach enables test-time continual learning via selective retrieval and iterative memory updates, with empirical evidence that memory selection and continual updates improve performance. The work provides a foundation for lifelong abstract reasoning in LLMs and releases resources to support future studies on abstraction-based memory.

Abstract

While inference-time scaling enables LLMs to carry out increasingly long and capable reasoning traces, the patterns and insights uncovered during these traces are immediately discarded once the context window is reset for a new query. External memory is a natural way to persist these discoveries, and recent work has shown clear benefits for reasoning-intensive tasks. We see an opportunity to make such memories more broadly reusable and scalable by moving beyond instance-based memory entries (e.g. exact query/response pairs, or summaries tightly coupled with the original problem context) toward concept-level memory: reusable, modular abstractions distilled from solution traces and stored in natural language. For future queries, relevant concepts are selectively retrieved and integrated into the prompt, enabling test-time continual learning without weight updates. Our design introduces new strategies for abstracting takeaways from rollouts and retrieving entries for new queries, promoting reuse and allowing memory to expand with additional experiences. We evaluate on ARC-AGI, a benchmark that stresses compositional generalization and abstract reasoning, making it a natural fit for concept memory. Our method yields a 7.5% relative gain over a strong no-memory baseline with performance continuing to scale with inference compute. We find abstract concepts to be the most consistent memory design, outscoring the baseline at all tested inference compute scales. Moreover, dynamically updating memory during test-time outperforms fixed settings, supporting the hypothesis that accumulating and abstracting patterns enables further solutions in a form of self-improvement. Code is available at https://github.com/matt-seb-ho/arc_memo.

Paper Structure

This paper contains 43 sections, 2 equations, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure 1: Instance-Level vs. Abstract Concepts Example. Each ARC-AGI chollet2019measureintelligence puzzle requires inferring the transformation rule for a set of input/output pixel grids. Here, Puzzle 1 instantiates $(A \land B) \Rightarrow C$ and Puzzle 2 instantiates $D \Rightarrow E$. The target puzzle is solved by recombining these ideas ($B \Rightarrow E, D \Rightarrow C$). Instance-level memory tends to store fully composed rules, coupling $A$ with $B,C$, and so on. Transferring to the target then demands both ignoring $A$ and disentangling/reordering $B,C$ with $D,E$. Abstract memory instead stores $A,B,C,D,E$ as separate, modular concepts, making them easier to recognize and reassemble in new contexts.
  • Figure 2: Method Diagram. Implementing a memory system requires defining (1) what is stored, (2) how memory is updated, and (3) how memory is used for new queries. The key novelty in this work is emphasizing abstraction and modularity, and the corresponding design changes. In particular, we highlight that parameterization (with higher-order functions allowed and encouraged) promotes abstraction, and typed interface definitions support modularity by showing which concepts can be combined. Since these memory entries are more abstract, they also require more inference to map against new, concrete situations--whether by aligning input against the memory format in a preprocessing query, or leveraging reasoning models to explore in a directed manner.
  • Figure 3: Open-Ended (OE) vs. Program Synthesis (PS) Concept Examples. An example of concepts from puzzle 9af7a82c abstracted into each concept format. OE defers to the model, while PS imposes structure to encourage abstraction/modularity. Higher order behavior is demonstrated with the "sort objects" concept taking a Callable parameter that specifies specific variations.
  • Figure 4: Token Efficiency Plot: Comparing various settings' official score (oracle@2) on the public validation subset to the tokens used by the reasoning model for strictly puzzle solving.