DialSim: A Dialogue Simulator for Evaluating Long-Term Multi-Party Dialogue Understanding of Conversational Agents
Jiho Kim, Woosog Chay, Hyeonji Hwang, Daeun Kyung, Hyunseung Chung, Eunbyeol Cho, Yeonsu Kwon, Yohan Jo, Edward Choi
TL;DR
DialSim introduces a dialogue-simulation framework to evaluate long-term, multi-party dialogue understanding in conversational agents. It pairs this with LongDialQA, a large QA dataset derived from long-running TV scripts that combines fan-quiz and temporal knowledge-graph questions, plus anonymization and adversarial variants. Experiments across multiple LLMs and memory strategies reveal that current models struggle to maintain accurate comprehension over extended, interconnected dialogue histories, even with extended context or retrieval. The work highlights the need for more realistic benchmarks and improved memory-reasoning capabilities to advance long-term dialogue understanding in AI systems.
Abstract
Recent advancements in Large Language Models (LLMs) have significantly enhanced conversational agents, making them applicable to various fields (e.g., education, entertainment). Despite their progress, the evaluation of the agents often overlooks the complexities of real-world conversations, such as multi-party dialogues and extended contextual dependencies. To bridge this gap, we introduce DialSim, a dialogue simulation-based evaluation framework. In DialSim, an agent assumes the role of a character in a scripted conversation and is evaluated on their ability to answer spontaneous questions using only the dialogue history, while recognizing when they lack sufficient information. To support this framework, we introduce LongDialQA, a new QA dataset constructed from long-running TV shows, comprising over 1,300 dialogue sessions, each paired with more than 1,000 carefully curated questions, totaling over 352,000 tokens. To minimize reliance on prior knowledge, all character names are anonymized or swapped. Our evaluation of state-of-the-art LLM-based conversational agents using DialSim reveals that even models with large context windows or RAG capabilities struggle to maintain accurate comprehension over long-term, multi-party interactions-underscoring the need for more realistic and challenging benchmarks in conversational AI.
