X-TURING: Towards an Enhanced and Efficient Turing Test for Long-Term Dialogue Agents
Weiqi Wu, Hongqiu Wu, Hai Zhao
TL;DR
X-Turing redefines the Turing test for long-term dialogue by introducing burst dialogue, pseudo-dialogue generation, and the $X$-Turn Pass Rate to enable scalable evaluation of LLM-based agents with reduced human workload. The framework deploys chatbots built from real social dialogue histories, uses iterative pseudo-dialogues to simulate extended interactions, and pairs human-machine conversations with human-human references evaluated via questionnaires. Experiments with GPT-4, Claude-3-Sonnet, and Qwen-110B reveal GPT-4 as the most human-like across several short turns, but all models exhibit substantial degradation in human-likeness as dialogue length increases, especially under burst conditions. The work demonstrates a practical, data-efficient approach to long-term evaluation of dialogue agents and identifies key factors—dialogue history, topic coverage, and judge modality—that influence outcomes.
Abstract
The Turing test examines whether AIs exhibit human-like behaviour in natural language conversations. The traditional setting limits each participant to one message at a time and requires constant human participation. This fails to reflect a natural conversational style and hinders the evaluation of dialogue agents based on Large Language Models (LLMs) in complex and prolonged interactions. This paper proposes \textbf{\textsc{X-Turing}}, which enhances the original test with a \textit{burst dialogue} pattern, allowing more dynamic exchanges using consecutive messages. It further reduces human workload by iteratively generating dialogues that simulate the long-term interaction between the agent and a human to compose the majority of the test process. With the \textit{pseudo-dialogue} history, the agent then engages in a shorter dialogue with a real human, which is paired with a human-human conversation on the same topic to be judged using questionnaires. We introduce the \textit{X-Turn Pass-Rate} metric to assess the human likeness of LLMs across varying durations. While LLMs like GPT-4 initially perform well, achieving pass rates of 51.9\% and 38.9\% during 3 turns and 10 turns of dialogues respectively, their performance drops as the dialogue progresses, which underscores the difficulty in maintaining consistency in the long term.
