Table of Contents
Fetching ...

CAST: Character-and-Scene Episodic Memory for Agents

Kexin Ma, Bojun Li, Yuhua Tang, Ruochun Jin, Liting Sun

TL;DR

CAST introduces a dual-memory system for agents by structuring episodic memory into character-centered scenes grounded in time, place, and action, while maintaining a semantic memory graph. Scenes are derived from short views via greedy 3D clustering and compiled into character profiles that track events across time. An offline index builds both semantic and episodic memories, and online query processing retrieves and fuses evidence from both memories to answer questions with improved episodic correctness. Experiments on LOCOMO and epbench show CAST achieving consistent gains, particularly on open-domain and time-sensitive queries, demonstrating robustness over baselines. The work highlights a practical, narrative-inspired approach to memory for agents and suggests directions for learnable scene boundaries and memory compression.

Abstract

Episodic memory is a central component of human memory, which refers to the ability to recall coherent events grounded in who, when, and where. However, most agent memory systems only emphasize semantic recall and treat experience as structures such as key-value, vector, or graph, which makes them struggle to represent and retrieve coherent events. To address this challenge, we propose a Character-and-Scene based memory architecture(CAST) inspired by dramatic theory. Specifically, CAST constructs 3D scenes (time/place/topic) and organizes them into character profiles that summarize the events of a character to represent episodic memory. Moreover, CAST complements this episodic memory with a graph-based semantic memory, which yields a robust dual memory design. Experiments demonstrate that CAST has averagely improved 8.11% F1 and 10.21% J(LLM-as-a-Judge) than baselines on various datasets, especially on open and time-sensitive conversational questions.

CAST: Character-and-Scene Episodic Memory for Agents

TL;DR

CAST introduces a dual-memory system for agents by structuring episodic memory into character-centered scenes grounded in time, place, and action, while maintaining a semantic memory graph. Scenes are derived from short views via greedy 3D clustering and compiled into character profiles that track events across time. An offline index builds both semantic and episodic memories, and online query processing retrieves and fuses evidence from both memories to answer questions with improved episodic correctness. Experiments on LOCOMO and epbench show CAST achieving consistent gains, particularly on open-domain and time-sensitive queries, demonstrating robustness over baselines. The work highlights a practical, narrative-inspired approach to memory for agents and suggests directions for learnable scene boundaries and memory compression.

Abstract

Episodic memory is a central component of human memory, which refers to the ability to recall coherent events grounded in who, when, and where. However, most agent memory systems only emphasize semantic recall and treat experience as structures such as key-value, vector, or graph, which makes them struggle to represent and retrieve coherent events. To address this challenge, we propose a Character-and-Scene based memory architecture(CAST) inspired by dramatic theory. Specifically, CAST constructs 3D scenes (time/place/topic) and organizes them into character profiles that summarize the events of a character to represent episodic memory. Moreover, CAST complements this episodic memory with a graph-based semantic memory, which yields a robust dual memory design. Experiments demonstrate that CAST has averagely improved 8.11% F1 and 10.21% J(LLM-as-a-Judge) than baselines on various datasets, especially on open and time-sensitive conversational questions.
Paper Structure (50 sections, 5 figures, 7 tables, 2 algorithms)

This paper contains 50 sections, 5 figures, 7 tables, 2 algorithms.

Figures (5)

  • Figure 1: Comparison between the flat memory of current LLM agents (left) and our CAST inspired by human cognition and dramatic theory (right).
  • Figure 2: Example of scene aggregation.
  • Figure 3: The overview of CAST.
  • Figure 4: The analysis experiments of CAST. Among them, (a) denotes the results of view parameters, (b) and (d) denote the results of scene aggregation parameters, (c) denotes the results of retrieval parameters.
  • Figure 5: The ablation experiments of CAST.