AMemGym: Interactive Memory Benchmarking for Assistants in Long-Horizon Conversations

Cheng Jiayang; Dongyu Ru; Lin Qiu; Yiyang Li; Xuezhi Cao; Yangqiu Song; Xunliang Cai

AMemGym: Interactive Memory Benchmarking for Assistants in Long-Horizon Conversations

Cheng Jiayang, Dongyu Ru, Lin Qiu, Yiyang Li, Xuezhi Cao, Yangqiu Song, Xunliang Cai

TL;DR

AMemGym, an interactive environment enabling on-policy evaluation and optimization for memory-driven personalization, provides a scalable, diagnostically rich environment for advancing memory capabilities in conversational agents.

Abstract

Long-horizon interactions between users and LLM-based assistants necessitate effective memory management, yet current approaches face challenges in training and evaluation of memory. Existing memory benchmarks rely on static, off-policy data as context, limiting evaluation reliability and scalability. To address these gaps, we introduce AMemGym, an interactive environment enabling on-policy evaluation and optimization for memory-driven personalization. AMemGym employs structured data sampling to predefine user profiles, state-dependent questions, and state evolution trajectories, enabling cost-effective generation of high-quality, evaluation-aligned interactions. LLM-simulated users expose latent states through role-play while maintaining structured state consistency. Comprehensive metrics based on structured data guide both assessment and optimization of assistants. Extensive experiments reveal performance gaps in existing memory systems (e.g., RAG, long-context LLMs, and agentic memory) and corresponding reasons. AMemGym not only enables effective selection among competing approaches but also can potentially drive the self-evolution of memory management strategies. By bridging structured state evolution with free-form interactions, our framework provides a scalable, diagnostically rich environment for advancing memory capabilities in conversational agents.

AMemGym: Interactive Memory Benchmarking for Assistants in Long-Horizon Conversations

TL;DR

Abstract

Paper Structure (57 sections, 14 figures, 8 tables, 1 algorithm)

This paper contains 57 sections, 14 figures, 8 tables, 1 algorithm.

Introduction
Related Work
Benchmarks for agent memory evaluation.
Interactive agent evaluation by user simulation.
AMemGym
Structured Data Generation for On-Policy Interaction
On-policy Interaction
Evaluation Metrics
Meta-Evaluation
Memory Evaluation with AMemGym
Evaluation Setup
On-policy versus Off-policy Evaluation
Evaluation on Native LLMs and Agents
Diagnosis on Memory Agents
Can Memory Agents Self-Evolve Through Interaction?
...and 42 more sections

Figures (14)

Figure 1: On-policy v.s. off-policy evaluation for assistants' memory.
Figure 2: An overview of the AMemGym framework.
Figure 3: An overview of diagnostic metrics: write, read, and utilization.
Figure 4: Memory implementations.
Figure 5: Evaluation on native LLMs. Overall scores and normalized memory scores are both demonstrated.
...and 9 more figures

AMemGym: Interactive Memory Benchmarking for Assistants in Long-Horizon Conversations

TL;DR

Abstract

AMemGym: Interactive Memory Benchmarking for Assistants in Long-Horizon Conversations

Authors

TL;DR

Abstract

Table of Contents

Figures (14)