Table of Contents
Fetching ...

Mem-α: Learning Memory Construction via Reinforcement Learning

Yu Wang, Ryuichi Takanobu, Zhiqi Liang, Yuzhen Mao, Yuanzhe Hu, Julian McAuley, Xiaojian Wu

TL;DR

Mem-alpha introduces a reinforcement learning framework to learn memory construction for LLM agents, addressing limited context windows by training a three-component memory architecture (core, semantic, episodic) via interactions and rewards derived from downstream QA. The system uses a diverse training dataset and a retrieval-augmented generation pipeline to evaluate memory comprehensiveness, optimizing policies with Group Relative Policy Optimization. Empirically, Mem-alpha outperforms strong memory baselines across retrieval, long-range understanding, and TTL tasks, and generalizes to sequences far longer than training data. The work highlights that learned memory construction can yield robust, scalable memory management suitable for long-context reasoning and real-world deployment.

Abstract

Large language model (LLM) agents are constrained by limited context windows, necessitating external memory systems for long-term information understanding. Current memory-augmented agents typically depend on pre-defined instructions and tools for memory updates. However, language models may lack the ability to determine which information to store, how to structure it, and when to update it, especially as memory systems become more complex. This results in suboptimal memory construction and information loss. To this end, we propose Mem-alpha, a reinforcement learning framework that trains agents to effectively manage complex memory systems through interaction and feedback. We also construct a specialized training dataset spanning diverse multi-turn interaction patterns paired with comprehensive evaluation questions designed to teach effective memory management. During training, agents process sequential information chunks, learn to extract and store relevant content, then update the memory system. The reward signal derives from downstream question-answering accuracy over the full interaction history, directly optimizing for memory construction. To illustrate the effectiveness of our training framework, we design a memory architecture comprising core, episodic, and semantic components, equipped with multiple tools for memory operations. Empirical evaluation demonstrates that Mem-alpha achieves significant improvements over existing memory-augmented agent baselines. Despite being trained exclusively on instances with a maximum length of 30k tokens, our agents exhibit remarkable generalization to sequences exceeding 400k tokens, over 13x the training length, highlighting the robustness of Mem-alpha.

Mem-α: Learning Memory Construction via Reinforcement Learning

TL;DR

Mem-alpha introduces a reinforcement learning framework to learn memory construction for LLM agents, addressing limited context windows by training a three-component memory architecture (core, semantic, episodic) via interactions and rewards derived from downstream QA. The system uses a diverse training dataset and a retrieval-augmented generation pipeline to evaluate memory comprehensiveness, optimizing policies with Group Relative Policy Optimization. Empirically, Mem-alpha outperforms strong memory baselines across retrieval, long-range understanding, and TTL tasks, and generalizes to sequences far longer than training data. The work highlights that learned memory construction can yield robust, scalable memory management suitable for long-context reasoning and real-world deployment.

Abstract

Large language model (LLM) agents are constrained by limited context windows, necessitating external memory systems for long-term information understanding. Current memory-augmented agents typically depend on pre-defined instructions and tools for memory updates. However, language models may lack the ability to determine which information to store, how to structure it, and when to update it, especially as memory systems become more complex. This results in suboptimal memory construction and information loss. To this end, we propose Mem-alpha, a reinforcement learning framework that trains agents to effectively manage complex memory systems through interaction and feedback. We also construct a specialized training dataset spanning diverse multi-turn interaction patterns paired with comprehensive evaluation questions designed to teach effective memory management. During training, agents process sequential information chunks, learn to extract and store relevant content, then update the memory system. The reward signal derives from downstream question-answering accuracy over the full interaction history, directly optimizing for memory construction. To illustrate the effectiveness of our training framework, we design a memory architecture comprising core, episodic, and semantic components, equipped with multiple tools for memory operations. Empirical evaluation demonstrates that Mem-alpha achieves significant improvements over existing memory-augmented agent baselines. Despite being trained exclusively on instances with a maximum length of 30k tokens, our agents exhibit remarkable generalization to sequences exceeding 400k tokens, over 13x the training length, highlighting the robustness of Mem-alpha.

Paper Structure

This paper contains 49 sections, 11 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Reinforcement learning teaches agents to select appropriate memory tools and types. Before training (left), agents struggle with tool selection when given new information. After RL training (right), agents learn effective memory management policies.
  • Figure 2: Training Framework of Mem-$\alpha$.
  • Figure 3: Memory Architecture: Core Memory stores a single paragraph (max 512 tokens), while Semantic Memory and Episodic Memory maintain expandable lists of sentences for facts and timestamped events, respectively.
  • Figure 4: The prompt used to extract keywords in the summaries of BookSum and InfBench-Sum.
  • Figure 5: The examples in the training dataset. For SQuAD, HotpotQA, PerLTQA, LME-Train, we show the examples directly; for Test-Time-Learning datasets (Pubmed-RCT, NLU, and Trec-C) and BookSum, we demonstrate the format for clarity.
  • ...and 5 more figures