X-TURING: Towards an Enhanced and Efficient Turing Test for Long-Term Dialogue Agents

Weiqi Wu; Hongqiu Wu; Hai Zhao

X-TURING: Towards an Enhanced and Efficient Turing Test for Long-Term Dialogue Agents

Weiqi Wu, Hongqiu Wu, Hai Zhao

TL;DR

X-Turing redefines the Turing test for long-term dialogue by introducing burst dialogue, pseudo-dialogue generation, and the $X$-Turn Pass Rate to enable scalable evaluation of LLM-based agents with reduced human workload. The framework deploys chatbots built from real social dialogue histories, uses iterative pseudo-dialogues to simulate extended interactions, and pairs human-machine conversations with human-human references evaluated via questionnaires. Experiments with GPT-4, Claude-3-Sonnet, and Qwen-110B reveal GPT-4 as the most human-like across several short turns, but all models exhibit substantial degradation in human-likeness as dialogue length increases, especially under burst conditions. The work demonstrates a practical, data-efficient approach to long-term evaluation of dialogue agents and identifies key factors—dialogue history, topic coverage, and judge modality—that influence outcomes.

Abstract

The Turing test examines whether AIs exhibit human-like behaviour in natural language conversations. The traditional setting limits each participant to one message at a time and requires constant human participation. This fails to reflect a natural conversational style and hinders the evaluation of dialogue agents based on Large Language Models (LLMs) in complex and prolonged interactions. This paper proposes \textbf{\textsc{X-Turing}}, which enhances the original test with a \textit{burst dialogue} pattern, allowing more dynamic exchanges using consecutive messages. It further reduces human workload by iteratively generating dialogues that simulate the long-term interaction between the agent and a human to compose the majority of the test process. With the \textit{pseudo-dialogue} history, the agent then engages in a shorter dialogue with a real human, which is paired with a human-human conversation on the same topic to be judged using questionnaires. We introduce the \textit{X-Turn Pass-Rate} metric to assess the human likeness of LLMs across varying durations. While LLMs like GPT-4 initially perform well, achieving pass rates of 51.9\% and 38.9\% during 3 turns and 10 turns of dialogues respectively, their performance drops as the dialogue progresses, which underscores the difficulty in maintaining consistency in the long term.

X-TURING: Towards an Enhanced and Efficient Turing Test for Long-Term Dialogue Agents

TL;DR

X-Turing redefines the Turing test for long-term dialogue by introducing burst dialogue, pseudo-dialogue generation, and the

-Turn Pass Rate to enable scalable evaluation of LLM-based agents with reduced human workload. The framework deploys chatbots built from real social dialogue histories, uses iterative pseudo-dialogues to simulate extended interactions, and pairs human-machine conversations with human-human references evaluated via questionnaires. Experiments with GPT-4, Claude-3-Sonnet, and Qwen-110B reveal GPT-4 as the most human-like across several short turns, but all models exhibit substantial degradation in human-likeness as dialogue length increases, especially under burst conditions. The work demonstrates a practical, data-efficient approach to long-term evaluation of dialogue agents and identifies key factors—dialogue history, topic coverage, and judge modality—that influence outcomes.

Abstract

Paper Structure (31 sections, 2 equations, 8 figures, 8 tables, 1 algorithm)

This paper contains 31 sections, 2 equations, 8 figures, 8 tables, 1 algorithm.

Introduction
Related Work
Turing Test
Role-play with LLMs
Dialogue Generation
X-Turing
Burst Dialogue v.s. Ping-pong Dialogue
Chatbot Construction
Pseudo-Dialogue Generation
Experiments
Test Setup
Human Judges
LLM Judges
Metric: X-Turn Pass Rate
Results
...and 16 more sections

Figures (8)

Figure 1: Overview of the X-Turing. Prompted by the dialogue history of a target person (Human 1), the LLM chats with another human (Human 2) after a specific turn of pseudo-dialogue to simulate its long-term interaction performance. The judges then distinguish between the LLM and Human 1 when they each converse with Human 2.
Figure 2: Chatbot system enabling burst dialogue.
Figure 3: Pipeline of Pseudo-Dialogue Generation.
Figure 4: Topics covered in the test.
Figure 5: Participant distribution across different demographics, showcasing age, education level, and AI knowledge, highlighting their average test accuracy in the 10-turn Turing test.
...and 3 more figures

X-TURING: Towards an Enhanced and Efficient Turing Test for Long-Term Dialogue Agents

TL;DR

Abstract

X-TURING: Towards an Enhanced and Efficient Turing Test for Long-Term Dialogue Agents

Authors

TL;DR

Abstract

Table of Contents

Figures (8)