Large Language Models Pass the Turing Test
Cameron R. Jones, Benjamin K. Bergen
TL;DR
This study evaluates whether contemporary large language models can pass the classic three-party Turing test under controlled, randomized conditions across two populations. It manipulates prompting with NO-PERSONA and PERSONA prompts to assess robustness, comparing GPT-4.5, LLaMa-3.1-405B, GPT-4o, and ELIZA as witnesses. The key finding is that GPT-4.5-PERSONA achieves a 73% win rate, significantly above chance, demonstrating passing capability in this rigorous setting; LLaMa-3.1-405B also exceeds chance at 56% with persona, while NO-PERSONA conditions are closer to chance and ELIZA remains below chance. The results illuminate the critical role of prompting in Turing-test outcomes, raise questions about what the test measures (humanlikeness vs intelligence), and underscore potential social and economic implications of counterfeit human-like AI in everyday interactions.
Abstract
We evaluated 4 systems (ELIZA, GPT-4o, LLaMa-3.1-405B, and GPT-4.5) in two randomised, controlled, and pre-registered Turing tests on independent populations. Participants had 5 minute conversations simultaneously with another human participant and one of these systems before judging which conversational partner they thought was human. When prompted to adopt a humanlike persona, GPT-4.5 was judged to be the human 73% of the time: significantly more often than interrogators selected the real human participant. LLaMa-3.1, with the same prompt, was judged to be the human 56% of the time -- not significantly more or less often than the humans they were being compared to -- while baseline models (ELIZA and GPT-4o) achieved win rates significantly below chance (23% and 21% respectively). The results constitute the first empirical evidence that any artificial system passes a standard three-party Turing test. The results have implications for debates about what kind of intelligence is exhibited by Large Language Models (LLMs), and the social and economic impacts these systems are likely to have.
