Table of Contents
Fetching ...

DialSim: A Dialogue Simulator for Evaluating Long-Term Multi-Party Dialogue Understanding of Conversational Agents

Jiho Kim, Woosog Chay, Hyeonji Hwang, Daeun Kyung, Hyunseung Chung, Eunbyeol Cho, Yeonsu Kwon, Yohan Jo, Edward Choi

TL;DR

DialSim introduces a dialogue-simulation framework to evaluate long-term, multi-party dialogue understanding in conversational agents. It pairs this with LongDialQA, a large QA dataset derived from long-running TV scripts that combines fan-quiz and temporal knowledge-graph questions, plus anonymization and adversarial variants. Experiments across multiple LLMs and memory strategies reveal that current models struggle to maintain accurate comprehension over extended, interconnected dialogue histories, even with extended context or retrieval. The work highlights the need for more realistic benchmarks and improved memory-reasoning capabilities to advance long-term dialogue understanding in AI systems.

Abstract

Recent advancements in Large Language Models (LLMs) have significantly enhanced conversational agents, making them applicable to various fields (e.g., education, entertainment). Despite their progress, the evaluation of the agents often overlooks the complexities of real-world conversations, such as multi-party dialogues and extended contextual dependencies. To bridge this gap, we introduce DialSim, a dialogue simulation-based evaluation framework. In DialSim, an agent assumes the role of a character in a scripted conversation and is evaluated on their ability to answer spontaneous questions using only the dialogue history, while recognizing when they lack sufficient information. To support this framework, we introduce LongDialQA, a new QA dataset constructed from long-running TV shows, comprising over 1,300 dialogue sessions, each paired with more than 1,000 carefully curated questions, totaling over 352,000 tokens. To minimize reliance on prior knowledge, all character names are anonymized or swapped. Our evaluation of state-of-the-art LLM-based conversational agents using DialSim reveals that even models with large context windows or RAG capabilities struggle to maintain accurate comprehension over long-term, multi-party interactions-underscoring the need for more realistic and challenging benchmarks in conversational AI.

DialSim: A Dialogue Simulator for Evaluating Long-Term Multi-Party Dialogue Understanding of Conversational Agents

TL;DR

DialSim introduces a dialogue-simulation framework to evaluate long-term, multi-party dialogue understanding in conversational agents. It pairs this with LongDialQA, a large QA dataset derived from long-running TV scripts that combines fan-quiz and temporal knowledge-graph questions, plus anonymization and adversarial variants. Experiments across multiple LLMs and memory strategies reveal that current models struggle to maintain accurate comprehension over extended, interconnected dialogue histories, even with extended context or retrieval. The work highlights the need for more realistic benchmarks and improved memory-reasoning capabilities to advance long-term dialogue understanding in AI systems.

Abstract

Recent advancements in Large Language Models (LLMs) have significantly enhanced conversational agents, making them applicable to various fields (e.g., education, entertainment). Despite their progress, the evaluation of the agents often overlooks the complexities of real-world conversations, such as multi-party dialogues and extended contextual dependencies. To bridge this gap, we introduce DialSim, a dialogue simulation-based evaluation framework. In DialSim, an agent assumes the role of a character in a scripted conversation and is evaluated on their ability to answer spontaneous questions using only the dialogue history, while recognizing when they lack sufficient information. To support this framework, we introduce LongDialQA, a new QA dataset constructed from long-running TV shows, comprising over 1,300 dialogue sessions, each paired with more than 1,000 carefully curated questions, totaling over 352,000 tokens. To minimize reliance on prior knowledge, all character names are anonymized or swapped. Our evaluation of state-of-the-art LLM-based conversational agents using DialSim reveals that even models with large context windows or RAG capabilities struggle to maintain accurate comprehension over long-term, multi-party interactions-underscoring the need for more realistic and challenging benchmarks in conversational AI.
Paper Structure (33 sections, 10 figures, 14 tables, 1 algorithm)

This paper contains 33 sections, 10 figures, 14 tables, 1 algorithm.

Figures (10)

  • Figure 1: An overall process of DialSim. Gray bubbles represent scripted utterances, and white speech bubbles indicate spontaneous questions asked during the simulation. Colored speech bubbles indicate the agent's responses to the questions. (Left) An unanswerable question. (Center) A long-term event recall question. (Right) A multi-hop question that requires understanding past sessions (i.e., the Left and Center boxes). The dialogue and questions are based on the Friends script, with character names anonymized (e.g., Ross $\rightarrow$ Robert). The question is asked in the format chosen by the user, either in a multiple-choice format or as an open-ended question.
  • Figure 2: The overall process of question generation based on fan quizzes. First, we crawled fan quizzes from the web (1). Next, we applied filtering and revision processes to the crawled data (2). After that, we identified evidence scenes that could provide answers to the questions (3). From this, we created secondary versions of the questions by adding dates to each. We then mapped each question to the scenes by determining whether it is answerable in that scene or not (4). Finally, we applied character style transfer to make the questions more natural (5).
  • Figure 3: The overall process of question generation based on the temporal knowledge graph. We first extracted quadruples and constructed a temporal knowledge graph (1). Then, we generated questions based on this and mapped each question to the sessions by determining whether it was answerable in that session or not, similar to fan quiz-based questions (2). Character style transfer was performed afterwards (3).
  • Figure 4: The performance comparison between the oracle setting and the best memory management method.
  • Figure 5: The result of asking GPT-4o to explain Season 2, Episode 7 of Friends.
  • ...and 5 more figures