ArcMemo: Abstract Reasoning Composition with Lifelong LLM Memory
Matthew Ho, Chen Si, Zhaoxiang Feng, Fangxu Yu, Yichi Yang, Zhijian Liu, Zhiting Hu, Lianhui Qin
TL;DR
ArcMemo addresses the memory bottleneck in deploying large language models for long-horizon reasoning by introducing concept-level external memory that stores reusable, modular abstractions rather than problem-specific patterns. It presents two memory formats—Open-Ended (OE) and Program Synthesis (PS)—to promote abstraction and modularity, and demonstrates that PS-based concept memory yields the strongest gains on ARC-AGI, especially as inference compute scales. The approach enables test-time continual learning via selective retrieval and iterative memory updates, with empirical evidence that memory selection and continual updates improve performance. The work provides a foundation for lifelong abstract reasoning in LLMs and releases resources to support future studies on abstraction-based memory.
Abstract
While inference-time scaling enables LLMs to carry out increasingly long and capable reasoning traces, the patterns and insights uncovered during these traces are immediately discarded once the context window is reset for a new query. External memory is a natural way to persist these discoveries, and recent work has shown clear benefits for reasoning-intensive tasks. We see an opportunity to make such memories more broadly reusable and scalable by moving beyond instance-based memory entries (e.g. exact query/response pairs, or summaries tightly coupled with the original problem context) toward concept-level memory: reusable, modular abstractions distilled from solution traces and stored in natural language. For future queries, relevant concepts are selectively retrieved and integrated into the prompt, enabling test-time continual learning without weight updates. Our design introduces new strategies for abstracting takeaways from rollouts and retrieving entries for new queries, promoting reuse and allowing memory to expand with additional experiences. We evaluate on ARC-AGI, a benchmark that stresses compositional generalization and abstract reasoning, making it a natural fit for concept memory. Our method yields a 7.5% relative gain over a strong no-memory baseline with performance continuing to scale with inference compute. We find abstract concepts to be the most consistent memory design, outscoring the baseline at all tested inference compute scales. Moreover, dynamically updating memory during test-time outperforms fixed settings, supporting the hypothesis that accumulating and abstracting patterns enables further solutions in a form of self-improvement. Code is available at https://github.com/matt-seb-ho/arc_memo.
