Table of Contents
Fetching ...

Large Language Models Pass the Turing Test

Cameron R. Jones, Benjamin K. Bergen

TL;DR

This study evaluates whether contemporary large language models can pass the classic three-party Turing test under controlled, randomized conditions across two populations. It manipulates prompting with NO-PERSONA and PERSONA prompts to assess robustness, comparing GPT-4.5, LLaMa-3.1-405B, GPT-4o, and ELIZA as witnesses. The key finding is that GPT-4.5-PERSONA achieves a 73% win rate, significantly above chance, demonstrating passing capability in this rigorous setting; LLaMa-3.1-405B also exceeds chance at 56% with persona, while NO-PERSONA conditions are closer to chance and ELIZA remains below chance. The results illuminate the critical role of prompting in Turing-test outcomes, raise questions about what the test measures (humanlikeness vs intelligence), and underscore potential social and economic implications of counterfeit human-like AI in everyday interactions.

Abstract

We evaluated 4 systems (ELIZA, GPT-4o, LLaMa-3.1-405B, and GPT-4.5) in two randomised, controlled, and pre-registered Turing tests on independent populations. Participants had 5 minute conversations simultaneously with another human participant and one of these systems before judging which conversational partner they thought was human. When prompted to adopt a humanlike persona, GPT-4.5 was judged to be the human 73% of the time: significantly more often than interrogators selected the real human participant. LLaMa-3.1, with the same prompt, was judged to be the human 56% of the time -- not significantly more or less often than the humans they were being compared to -- while baseline models (ELIZA and GPT-4o) achieved win rates significantly below chance (23% and 21% respectively). The results constitute the first empirical evidence that any artificial system passes a standard three-party Turing test. The results have implications for debates about what kind of intelligence is exhibited by Large Language Models (LLMs), and the social and economic impacts these systems are likely to have.

Large Language Models Pass the Turing Test

TL;DR

This study evaluates whether contemporary large language models can pass the classic three-party Turing test under controlled, randomized conditions across two populations. It manipulates prompting with NO-PERSONA and PERSONA prompts to assess robustness, comparing GPT-4.5, LLaMa-3.1-405B, GPT-4o, and ELIZA as witnesses. The key finding is that GPT-4.5-PERSONA achieves a 73% win rate, significantly above chance, demonstrating passing capability in this rigorous setting; LLaMa-3.1-405B also exceeds chance at 56% with persona, while NO-PERSONA conditions are closer to chance and ELIZA remains below chance. The results illuminate the critical role of prompting in Turing-test outcomes, raise questions about what the test measures (humanlikeness vs intelligence), and underscore potential social and economic implications of counterfeit human-like AI in everyday interactions.

Abstract

We evaluated 4 systems (ELIZA, GPT-4o, LLaMa-3.1-405B, and GPT-4.5) in two randomised, controlled, and pre-registered Turing tests on independent populations. Participants had 5 minute conversations simultaneously with another human participant and one of these systems before judging which conversational partner they thought was human. When prompted to adopt a humanlike persona, GPT-4.5 was judged to be the human 73% of the time: significantly more often than interrogators selected the real human participant. LLaMa-3.1, with the same prompt, was judged to be the human 56% of the time -- not significantly more or less often than the humans they were being compared to -- while baseline models (ELIZA and GPT-4o) achieved win rates significantly below chance (23% and 21% respectively). The results constitute the first empirical evidence that any artificial system passes a standard three-party Turing test. The results have implications for debates about what kind of intelligence is exhibited by Large Language Models (LLMs), and the social and economic impacts these systems are likely to have.

Paper Structure

This paper contains 20 sections, 1 equation, 21 figures, 4 tables.

Figures (21)

  • Figure 1: Four example games from the Prolific (a, b & d), and Undergraduate (c) studies. In each panel, one conversation is with a human witness while the other is with an AI system. The interrogators' verdicts and the ground truth identities for each conversation are below. A version of the experiment can be accessed at https://turingtest.live.
  • Figure 2: Left: Win rates for each AI witness: the proportion of the time that the interrogator judged the AI system to be human rather than the actual human witness. Error bars represent 95% bootstrap confidence intervals. Asterisks next to each bar indicate whether the win rate was significantly different from chance (50%). Right: Confidence in verdicts where the interrogator selected the actual human or the AI model for each witness type. Each point represents a single game. Points further toward the left and right indicate higher confidence that the AI is the AI versus the human respectively. Error bars indicate 95% bootstrap confidence intervals around the mean.
  • Figure 3: Interrogator accuracy against exit survey responses. Accuracy is the proportion of the time that interrogators correctly identified the human witness. In the Undergraduate study, participants' self-report of their accuracy was positively correlated with their real accuracy, but this was not true in the Prolific study. In the Prolific group, there were significant effects of gender, the number of games an interrogator had completed, and the interrogator's self-reported estimate of how intelligent AI is, but none of these effects were significant in the Undergraduate study. There were no significant effects of any of the remaining variables in either group.
  • Figure 4: Classification of strategies employed by interrogators by proportion of games (left) and mean accuracy of games where strategies were deployed with 95% confidence intervals (right). Participants often engaged in small talk, asking witnesses about their personal details, activities, or opinions. Interrogators who said unusual things or used typical LLM "jailbreaks" were the most accurate.
  • Figure 5: Proportion of interrogator reasons (left) and mean accuracy of verdicts that cited specific reasons with 95% confidence intervals (right). Interrogators were much more likely to cite linguistic style, conversational flow, and socio-emotional factors such as personality, rather than factors more traditionally associated with intelligence, such as knowledge and reasoning. The most accurate verdicts focussed on witnesses' directness in handling questions as well as instances where they lacked knowledge.
  • ...and 16 more figures