Table of Contents
Fetching ...

People cannot distinguish GPT-4 from a human in a Turing test

Cameron R. Jones, Benjamin K. Bergen

TL;DR

This study empirically tests whether GPT-4 can pass a two-player Turing test in short text conversations, using GPT-4, GPT-3.5, and ELIZA as AI witnesses and a human comparator. Through a preregistered, randomized design with 500 participants, GPT-4 achieved a 54% pass rate, outperforming ELIZA (22%) but lagging behind human interrogators (67%), and showing no significant advantage over GPT-3.5. The results suggest that current AI can be mistaken for humans in interactive settings, highlighting deception risks and indicating that evaluators rely more on linguistic style and socio-emotional cues than on traditional notions of intelligence. Analyses of strategies and participant justifications reveal that social factors predominantly drive judgments, informing potential mitigations and influencing how we assess AI in real-world communications.

Abstract

We evaluated 3 systems (ELIZA, GPT-3.5 and GPT-4) in a randomized, controlled, and preregistered Turing test. Human participants had a 5 minute conversation with either a human or an AI, and judged whether or not they thought their interlocutor was human. GPT-4 was judged to be a human 54% of the time, outperforming ELIZA (22%) but lagging behind actual humans (67%). The results provide the first robust empirical demonstration that any artificial system passes an interactive 2-player Turing test. The results have implications for debates around machine intelligence and, more urgently, suggest that deception by current AI systems may go undetected. Analysis of participants' strategies and reasoning suggests that stylistic and socio-emotional factors play a larger role in passing the Turing test than traditional notions of intelligence.

People cannot distinguish GPT-4 from a human in a Turing test

TL;DR

This study empirically tests whether GPT-4 can pass a two-player Turing test in short text conversations, using GPT-4, GPT-3.5, and ELIZA as AI witnesses and a human comparator. Through a preregistered, randomized design with 500 participants, GPT-4 achieved a 54% pass rate, outperforming ELIZA (22%) but lagging behind human interrogators (67%), and showing no significant advantage over GPT-3.5. The results suggest that current AI can be mistaken for humans in interactive settings, highlighting deception risks and indicating that evaluators rely more on linguistic style and socio-emotional cues than on traditional notions of intelligence. Analyses of strategies and participant justifications reveal that social factors predominantly drive judgments, informing potential mitigations and influencing how we assess AI in real-world communications.

Abstract

We evaluated 3 systems (ELIZA, GPT-3.5 and GPT-4) in a randomized, controlled, and preregistered Turing test. Human participants had a 5 minute conversation with either a human or an AI, and judged whether or not they thought their interlocutor was human. GPT-4 was judged to be a human 54% of the time, outperforming ELIZA (22%) but lagging behind actual humans (67%). The results provide the first robust empirical demonstration that any artificial system passes an interactive 2-player Turing test. The results have implications for debates around machine intelligence and, more urgently, suggest that deception by current AI systems may go undetected. Analysis of participants' strategies and reasoning suggests that stylistic and socio-emotional factors play a larger role in passing the Turing test than traditional notions of intelligence.
Paper Structure (21 sections, 1 equation, 13 figures, 3 tables)

This paper contains 21 sections, 1 equation, 13 figures, 3 tables.

Figures (13)

  • Figure 1: A selection of conversations between human interrogators (green) and witnesses (grey). One of these four conversations is with a human witness, the rest are with AI. Interrogator verdicts and ground truth identities are below (to allow readers to indirectly participate).
  • Figure 2: Pass rates (left) and interrogator confidence (right) for each witness type. Pass rates are the proportion of the time a witness type was judged to be human. Error bars represent 95% bootstrap confidence intervals. Significance stars above each bar indicate whether the pass rate was significantly different from 50%. Comparisons show significant differences in pass rates between witness types. Right: Confidence in human and AI judgements for each witness type. Each point represents a single game. Points further toward the left and right indicate higher confidence in AI and human verdicts respectively.
  • Figure 3: Classification of strategies employed by interrogators by proportion of games (left) and mean accuracy of games where strategies were deployed (right). Participants often engaged in small talk, asking witnesses about their personal details, activities, or opinions. Interrogators who asked about logic, current events, or human emotions and experiences tended to be more accurate.
  • Figure 4: Proportion of interrogator reasons for AI verdicts (left) and Human verdicts (right), excluding ELIZA games. In both cases, interrogators were much more likely to cite linguistic style or socio-emotional factors such as personality, rather than factors more traditionally associated with intelligence, such as knowledge and reasoning.
  • Figure 5: Turing test game interface. Left: an in-progress conversation between an interrogator (green) and a witness (grey). The timer at the top shows time remaining in the game. Right: the decision interface the interrogator uses to give their verdict.
  • ...and 8 more figures