Beyond Prompts: Dynamic Conversational Benchmarking of Large Language Models

David Castillo-Bolado; Joseph Davidson; Finlay Gray; Marek Rosa

Beyond Prompts: Dynamic Conversational Benchmarking of Large Language Models

David Castillo-Bolado, Joseph Davidson, Finlay Gray, Marek Rosa

TL;DR

This work proposes the Long-Term Memory (LTM) Benchmark, a dynamic, single-conversation framework designed to evaluate memory, continual learning, and information integration in conversational agents. By interleaving multiple tasks within one dialogue and using a deterministic data-generation process plus a scheduling system, the benchmark exposes limitations of current LLMs, especially under task-switching and memory constraints. Results show that vanilla large-context LLMs struggle as memory spans exceed their context windows, whereas LTM-augmented or shorter-context models with external memory maintain robustness, suggesting a focusing effect from external memory. The benchmark, which is open-source, aims to drive development toward agents capable of sustained, coherent behavior in realistic, multi-task conversational settings and highlights the need for more ecologically valid evaluation of memory and learning capabilities.

Abstract

We introduce a dynamic benchmarking system for conversational agents that evaluates their performance through a single, simulated, and lengthy user$\leftrightarrow$agent interaction. The interaction is a conversation between the user and agent, where multiple tasks are introduced and then undertaken concurrently. We context switch regularly to interleave the tasks, which constructs a realistic testing scenario in which we assess the Long-Term Memory, Continual Learning, and Information Integration capabilities of the agents. Results from both proprietary and open-source Large-Language Models show that LLMs in general perform well on single-task interactions, but they struggle on the same tasks when they are interleaved. Notably, short-context LLMs supplemented with an LTM system perform as well as or better than those with larger contexts. Our benchmark suggests that there are other challenges for LLMs responding to more natural interactions that contemporary benchmarks have heretofore not been able to capture.

Beyond Prompts: Dynamic Conversational Benchmarking of Large Language Models

TL;DR

Abstract

We introduce a dynamic benchmarking system for conversational agents that evaluates their performance through a single, simulated, and lengthy user

agent interaction. The interaction is a conversation between the user and agent, where multiple tasks are introduced and then undertaken concurrently. We context switch regularly to interleave the tasks, which constructs a realistic testing scenario in which we assess the Long-Term Memory, Continual Learning, and Information Integration capabilities of the agents. Results from both proprietary and open-source Large-Language Models show that LLMs in general perform well on single-task interactions, but they struggle on the same tasks when they are interleaved. Notably, short-context LLMs supplemented with an LTM system perform as well as or better than those with larger contexts. Our benchmark suggests that there are other challenges for LLMs responding to more natural interactions that contemporary benchmarks have heretofore not been able to capture.

Paper Structure (53 sections, 1 equation, 36 figures, 1 table)

This paper contains 53 sections, 1 equation, 36 figures, 1 table.

Introduction
Definitions and Terms
The LTM benchmark
Test structure
Test scenarios
Generation processes
LTM Implementations
Results
Analysis
Other benchmarks
Limitations and future work
Conclusions
Societal Impacts and Ethical Considerations
Data Generation, Collection, Scoring
Colours
...and 38 more sections

Figures (36)

Figure 1: Standard one-shot evaluations focus on building a challenging prompt and evaluating the LLM's response. In a conversation there are many LLM calls involved and the prompt grows monotonically, and the task can either stay constant or be switched regularly.
Figure 2: (left) Outline of a test's structure as part of the entire benchmark conversation. Needles and questions are messages that spread out throughout the conversation, aiming to take as much space as the memory span allows. (right) Zooming in, we can see how the tests are intertwined, and how different questions make reference to distinct pieces of information.
Figure 3: Example beginning of a benchmark. The benchmark puts the agent in situation and then follows by interleaving test messages.
Figure 4: Example of the dummy task based on the TriviaQA dataset.
Figure 5: Scores obtained for all agent and memory span configurations. Solid boxes correspond to vanilla LLMs, while strided boxes represent LTM agents. Each color can be associated with a different LLM. Context sizes for each of the agents vary, and are detailed in Table \ref{['tab:results']}.
...and 31 more figures

Beyond Prompts: Dynamic Conversational Benchmarking of Large Language Models

TL;DR

Abstract

Beyond Prompts: Dynamic Conversational Benchmarking of Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (36)