Table of Contents
Fetching ...

REMem: Reasoning with Episodic Memory in Language Agent

Yiheng Shu, Saisri Padmaja Jonnalagedda, Xiang Gao, Bernal Jiménez Gutiérrez, Weijian Qi, Kamalika Das, Huan Sun, Yu Su

TL;DR

REMem is presented, a two-phase framework for constructing and reasoning with episodic memory that substantially outperforms state-of-the-art memory systems such as Mem0 and HippoRAG 2, showing 3.4% and 13.4% absolute improvements on episodic recollection and reasoning tasks, respectively.

Abstract

Humans excel at remembering concrete experiences along spatiotemporal contexts and performing reasoning across those events, i.e., the capacity for episodic memory. In contrast, memory in language agents remains mainly semantic, and current agents are not yet capable of effectively recollecting and reasoning over interaction histories. We identify and formalize the core challenges of episodic recollection and reasoning from this gap, and observe that existing work often overlooks episodicity, lacks explicit event modeling, or overemphasizes simple retrieval rather than complex reasoning. We present REMem, a two-phase framework for constructing and reasoning with episodic memory: 1) Offline indexing, where REMem converts experiences into a hybrid memory graph that flexibly links time-aware gists and facts. 2) Online inference, where REMem employs an agentic retriever with carefully curated tools for iterative retrieval over the memory graph. Comprehensive evaluation across four episodic memory benchmarks shows that REMem substantially outperforms state-of-the-art memory systems such as Mem0 and HippoRAG 2, showing 3.4% and 13.4% absolute improvements on episodic recollection and reasoning tasks, respectively. Moreover, REMem also demonstrates more robust refusal behavior for unanswerable questions.

REMem: Reasoning with Episodic Memory in Language Agent

TL;DR

REMem is presented, a two-phase framework for constructing and reasoning with episodic memory that substantially outperforms state-of-the-art memory systems such as Mem0 and HippoRAG 2, showing 3.4% and 13.4% absolute improvements on episodic recollection and reasoning tasks, respectively.

Abstract

Humans excel at remembering concrete experiences along spatiotemporal contexts and performing reasoning across those events, i.e., the capacity for episodic memory. In contrast, memory in language agents remains mainly semantic, and current agents are not yet capable of effectively recollecting and reasoning over interaction histories. We identify and formalize the core challenges of episodic recollection and reasoning from this gap, and observe that existing work often overlooks episodicity, lacks explicit event modeling, or overemphasizes simple retrieval rather than complex reasoning. We present REMem, a two-phase framework for constructing and reasoning with episodic memory: 1) Offline indexing, where REMem converts experiences into a hybrid memory graph that flexibly links time-aware gists and facts. 2) Online inference, where REMem employs an agentic retriever with carefully curated tools for iterative retrieval over the memory graph. Comprehensive evaluation across four episodic memory benchmarks shows that REMem substantially outperforms state-of-the-art memory systems such as Mem0 and HippoRAG 2, showing 3.4% and 13.4% absolute improvements on episodic recollection and reasoning tasks, respectively. Moreover, REMem also demonstrates more robust refusal behavior for unanswerable questions.
Paper Structure (44 sections, 8 figures, 21 tables, 2 algorithms)

This paper contains 44 sections, 8 figures, 21 tables, 2 algorithms.

Figures (8)

  • Figure 1: Overview of evaluation on episodic memory. Utterances are grounded to a timeline (top). We evaluate two progressive capabilities and show average scores on each (bottom): 1) Episodic recollection: recollect temporal and other situational elements of past experiences, measured by LLM-as-a-judge scores on LoCoMo and REALTALK. 2) Episodic reasoning: multi-step reasoning across the timeline based on recollection, e.g., event-to-event relations, counting, and ordinal queries, measured by LLM-as-a-judge score on Complex-TR and the EM score on Test of Time.
  • Figure 2: Overview of REMem. The indexing phase turns utterances into time-aware memory by extracting event gists and time-scoped facts (triples) and organizing them as a hybrid graph. The agentic inference phase invokes carefully curated tools over this graph to surface the most relevant gists and facts for reasoning in an iterative manner.
  • Figure 3: The prompts for gist extraction. The instructions and demonstrations are marked in different colors.
  • Figure 4: The prompts for fact extraction. The instructions and demonstrations are marked in different colors.
  • Figure 5: The prompts for tool selection. The instructions and demonstrations are marked in different colors.
  • ...and 3 more figures