If an LLM Were a Character, Would It Know Its Own Story? Evaluating Lifelong Learning in LLMs

Siqi Fan; Xiusheng Huang; Yiqun Yao; Xuezhi Fang; Kang Liu; Peng Han; Shuo Shang; Aixin Sun; Yequan Wang

If an LLM Were a Character, Would It Know Its Own Story? Evaluating Lifelong Learning in LLMs

Siqi Fan, Xiusheng Huang, Yiqun Yao, Xuezhi Fang, Kang Liu, Peng Han, Shuo Shang, Aixin Sun, Yequan Wang

TL;DR

This work addresses the challenge of measuring lifelong learning in LLMs by treating state evolution as a function of cumulative, multi-agent experiences. It introduces LifeState-Bench, a benchmark with Hamlet and synthetic episodic narratives that encode explicit timelines, scene details, and multi-character interactions, coupled with a fact-checking framework focused on self-awareness, episodic memory retrieval, and relationship shifts. The results indicate non-parametric memory approaches better preserve long-term context, while all tested models exhibit catastrophic forgetting across episodes, highlighting significant room for improvement in stateful LLM capabilities. Overall, LifeState-Bench provides a diagnostic platform for understanding and advancing long-horizon reasoning and memory in LLMs, guiding future lifelong-learning research.

Abstract

Large language models (LLMs) can carry out human-like dialogue, but unlike humans, they are stateless due to the superposition property. However, during multi-turn, multi-agent interactions, LLMs begin to exhibit consistent, character-like behaviors, hinting at a form of emergent lifelong learning. Despite this, existing benchmarks often fail to capture these dynamics, primarily focusing on static, open-ended evaluations. To address this gap, we introduce LIFESTATE-BENCH, a benchmark designed to assess lifelong learning in LLMs. It features two episodic datasets: Hamlet and a synthetic script collection, rich in narrative structure and character interactions. Our fact checking evaluation probes models' self-awareness, episodic memory retrieval, and relationship tracking, across both parametric and non-parametric approaches. Experiments on models like Llama3.1-8B, GPT-4-turbo, and DeepSeek R1, we demonstrate that nonparametric methods significantly outperform parametric ones in managing stateful learning. However, all models exhibit challenges with catastrophic forgetting as interactions extend, highlighting the need for further advancements in lifelong learning.

If an LLM Were a Character, Would It Know Its Own Story? Evaluating Lifelong Learning in LLMs

TL;DR

Abstract

If an LLM Were a Character, Would It Know Its Own Story? Evaluating Lifelong Learning in LLMs

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)