Table of Contents
Fetching ...

If an LLM Were a Character, Would It Know Its Own Story? Evaluating Lifelong Learning in LLMs

Siqi Fan, Xiusheng Huang, Yiqun Yao, Xuezhi Fang, Kang Liu, Peng Han, Shuo Shang, Aixin Sun, Yequan Wang

TL;DR

This work addresses the challenge of measuring lifelong learning in LLMs by treating state evolution as a function of cumulative, multi-agent experiences. It introduces LifeState-Bench, a benchmark with Hamlet and synthetic episodic narratives that encode explicit timelines, scene details, and multi-character interactions, coupled with a fact-checking framework focused on self-awareness, episodic memory retrieval, and relationship shifts. The results indicate non-parametric memory approaches better preserve long-term context, while all tested models exhibit catastrophic forgetting across episodes, highlighting significant room for improvement in stateful LLM capabilities. Overall, LifeState-Bench provides a diagnostic platform for understanding and advancing long-horizon reasoning and memory in LLMs, guiding future lifelong-learning research.

Abstract

Large language models (LLMs) can carry out human-like dialogue, but unlike humans, they are stateless due to the superposition property. However, during multi-turn, multi-agent interactions, LLMs begin to exhibit consistent, character-like behaviors, hinting at a form of emergent lifelong learning. Despite this, existing benchmarks often fail to capture these dynamics, primarily focusing on static, open-ended evaluations. To address this gap, we introduce LIFESTATE-BENCH, a benchmark designed to assess lifelong learning in LLMs. It features two episodic datasets: Hamlet and a synthetic script collection, rich in narrative structure and character interactions. Our fact checking evaluation probes models' self-awareness, episodic memory retrieval, and relationship tracking, across both parametric and non-parametric approaches. Experiments on models like Llama3.1-8B, GPT-4-turbo, and DeepSeek R1, we demonstrate that nonparametric methods significantly outperform parametric ones in managing stateful learning. However, all models exhibit challenges with catastrophic forgetting as interactions extend, highlighting the need for further advancements in lifelong learning.

If an LLM Were a Character, Would It Know Its Own Story? Evaluating Lifelong Learning in LLMs

TL;DR

This work addresses the challenge of measuring lifelong learning in LLMs by treating state evolution as a function of cumulative, multi-agent experiences. It introduces LifeState-Bench, a benchmark with Hamlet and synthetic episodic narratives that encode explicit timelines, scene details, and multi-character interactions, coupled with a fact-checking framework focused on self-awareness, episodic memory retrieval, and relationship shifts. The results indicate non-parametric memory approaches better preserve long-term context, while all tested models exhibit catastrophic forgetting across episodes, highlighting significant room for improvement in stateful LLM capabilities. Overall, LifeState-Bench provides a diagnostic platform for understanding and advancing long-horizon reasoning and memory in LLMs, guiding future lifelong-learning research.

Abstract

Large language models (LLMs) can carry out human-like dialogue, but unlike humans, they are stateless due to the superposition property. However, during multi-turn, multi-agent interactions, LLMs begin to exhibit consistent, character-like behaviors, hinting at a form of emergent lifelong learning. Despite this, existing benchmarks often fail to capture these dynamics, primarily focusing on static, open-ended evaluations. To address this gap, we introduce LIFESTATE-BENCH, a benchmark designed to assess lifelong learning in LLMs. It features two episodic datasets: Hamlet and a synthetic script collection, rich in narrative structure and character interactions. Our fact checking evaluation probes models' self-awareness, episodic memory retrieval, and relationship tracking, across both parametric and non-parametric approaches. Experiments on models like Llama3.1-8B, GPT-4-turbo, and DeepSeek R1, we demonstrate that nonparametric methods significantly outperform parametric ones in managing stateful learning. However, all models exhibit challenges with catastrophic forgetting as interactions extend, highlighting the need for further advancements in lifelong learning.

Paper Structure

This paper contains 43 sections, 3 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Dataset Statistics. Triangles represent role ability benchmarks, while circles denote dialogue agent benchmarks.
  • Figure 2: Method Overview. Our benchmark captures three key features: cumulative experience, fact-checking, and memory testing. Finally, the LLM judge scoring system is located in the bottom-right corner.
  • Figure 3: Episode-wise Performance of Hamlet and Synthetic Datasets. This includes the overall performance of various methods, as well as performance from different state perspectives.