CAST: Character-and-Scene Episodic Memory for Agents
Kexin Ma, Bojun Li, Yuhua Tang, Ruochun Jin, Liting Sun
TL;DR
CAST introduces a dual-memory system for agents by structuring episodic memory into character-centered scenes grounded in time, place, and action, while maintaining a semantic memory graph. Scenes are derived from short views via greedy 3D clustering and compiled into character profiles that track events across time. An offline index builds both semantic and episodic memories, and online query processing retrieves and fuses evidence from both memories to answer questions with improved episodic correctness. Experiments on LOCOMO and epbench show CAST achieving consistent gains, particularly on open-domain and time-sensitive queries, demonstrating robustness over baselines. The work highlights a practical, narrative-inspired approach to memory for agents and suggests directions for learnable scene boundaries and memory compression.
Abstract
Episodic memory is a central component of human memory, which refers to the ability to recall coherent events grounded in who, when, and where. However, most agent memory systems only emphasize semantic recall and treat experience as structures such as key-value, vector, or graph, which makes them struggle to represent and retrieve coherent events. To address this challenge, we propose a Character-and-Scene based memory architecture(CAST) inspired by dramatic theory. Specifically, CAST constructs 3D scenes (time/place/topic) and organizes them into character profiles that summarize the events of a character to represent episodic memory. Moreover, CAST complements this episodic memory with a graph-based semantic memory, which yields a robust dual memory design. Experiments demonstrate that CAST has averagely improved 8.11% F1 and 10.21% J(LLM-as-a-Judge) than baselines on various datasets, especially on open and time-sensitive conversational questions.
