Table of Contents
Fetching ...

StoryBench: A Dynamic Benchmark for Evaluating Long-Term Memory with Multi Turns

Luanbo Wan, Weizhi Ma

TL;DR

StoryBench presents a dynamic, interactive-fiction–based benchmark to systematically evaluate long-term memory in large language models. By incorporating branching narratives and two task modes—Immediate Feedback and Self Recovery—the framework separately probes knowledge retention and sequential reasoning across extended multi-turn interactions. The accompanying dataset, built from The Invisible Guardian, enables controlled evaluation of long-horizon dependencies and multi-solution pathways. Experimental results across four models reveal notable gaps in memory robustness and adaptive reasoning, underscoring the need for memory-augmented architectures and more flexible evaluation protocols.

Abstract

Long-term memory (LTM) is essential for large language models (LLMs) to achieve autonomous intelligence in complex, evolving environments. Despite increasing efforts in memory-augmented and retrieval-based architectures, there remains a lack of standardized benchmarks to systematically evaluate LLMs' long-term memory abilities. Existing benchmarks still face challenges in evaluating knowledge retention and dynamic sequential reasoning, and in their own flexibility, all of which limit their effectiveness in assessing models' LTM capabilities. To address these gaps, we propose a novel benchmark framework based on interactive fiction games, featuring dynamically branching storylines with complex reasoning structures. These structures simulate real-world scenarios by requiring LLMs to navigate hierarchical decision trees, where each choice triggers cascading dependencies across multi-turn interactions. Our benchmark emphasizes two distinct settings to test reasoning complexity: one with immediate feedback upon incorrect decisions, and the other requiring models to independently trace back and revise earlier choices after failure. As part of this benchmark, we also construct a new dataset designed to test LLMs' LTM within narrative-driven environments. We further validate the effectiveness of our approach through detailed experiments. Experimental results demonstrate the benchmark's ability to robustly and reliably assess LTM in LLMs.

StoryBench: A Dynamic Benchmark for Evaluating Long-Term Memory with Multi Turns

TL;DR

StoryBench presents a dynamic, interactive-fiction–based benchmark to systematically evaluate long-term memory in large language models. By incorporating branching narratives and two task modes—Immediate Feedback and Self Recovery—the framework separately probes knowledge retention and sequential reasoning across extended multi-turn interactions. The accompanying dataset, built from The Invisible Guardian, enables controlled evaluation of long-horizon dependencies and multi-solution pathways. Experimental results across four models reveal notable gaps in memory robustness and adaptive reasoning, underscoring the need for memory-augmented architectures and more flexible evaluation protocols.

Abstract

Long-term memory (LTM) is essential for large language models (LLMs) to achieve autonomous intelligence in complex, evolving environments. Despite increasing efforts in memory-augmented and retrieval-based architectures, there remains a lack of standardized benchmarks to systematically evaluate LLMs' long-term memory abilities. Existing benchmarks still face challenges in evaluating knowledge retention and dynamic sequential reasoning, and in their own flexibility, all of which limit their effectiveness in assessing models' LTM capabilities. To address these gaps, we propose a novel benchmark framework based on interactive fiction games, featuring dynamically branching storylines with complex reasoning structures. These structures simulate real-world scenarios by requiring LLMs to navigate hierarchical decision trees, where each choice triggers cascading dependencies across multi-turn interactions. Our benchmark emphasizes two distinct settings to test reasoning complexity: one with immediate feedback upon incorrect decisions, and the other requiring models to independently trace back and revise earlier choices after failure. As part of this benchmark, we also construct a new dataset designed to test LLMs' LTM within narrative-driven environments. We further validate the effectiveness of our approach through detailed experiments. Experimental results demonstrate the benchmark's ability to robustly and reliably assess LTM in LLMs.

Paper Structure

This paper contains 26 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Immediate Feedback. The model is informed immediately after each incorrect choice and prompted to retry until the correct option is selected.
  • Figure 2: Self Recovery. An incorrect choice leads to a failure ending either immediately or after several scenes. The model is then asked to identify the earliest point in the story where it believes the incorrect decision occurred and to attempt recovery from that point.
  • Figure 3: Four typical patterns illustrating dataset structure complexity.
  • Figure 4: Scene node example with character descriptions, dialogues, and other details.
  • Figure 5: Choice node example with choice text, branches, and other details.
  • ...and 3 more figures