Table of Contents
Fetching ...

Does GPT-4 pass the Turing test?

Cameron R. Jones, Benjamin K. Bergen

TL;DR

Does GPT-4 pass the Turing test? The paper reports that GPT-4 can deceive under certain prompts in a large online test, achieving a best SR around $0.50$ but not surpassing the human baseline of $0.66$. The observed decisions were driven mainly by linguistic style and socioemotional traits rather than classical intelligence, and interrogator experience with LLMs improved detection. The authors call for pre-registered, randomized studies with broader baselines and external tools to better gauge deception risks. The work underscores the continued relevance of naturalistic communication as a testbed for AI, while cautioning against overinterpreting pass rates as evidence of human-level intelligence.

Abstract

We evaluated GPT-4 in a public online Turing test. The best-performing GPT-4 prompt passed in 49.7% of games, outperforming ELIZA (22%) and GPT-3.5 (20%), but falling short of the baseline set by human participants (66%). Participants' decisions were based mainly on linguistic style (35%) and socioemotional traits (27%), supporting the idea that intelligence, narrowly conceived, is not sufficient to pass the Turing test. Participant knowledge about LLMs and number of games played positively correlated with accuracy in detecting AI, suggesting learning and practice as possible strategies to mitigate deception. Despite known limitations as a test of intelligence, we argue that the Turing test continues to be relevant as an assessment of naturalistic communication and deception. AI models with the ability to masquerade as humans could have widespread societal consequences, and we analyse the effectiveness of different strategies and criteria for judging humanlikeness.

Does GPT-4 pass the Turing test?

TL;DR

Does GPT-4 pass the Turing test? The paper reports that GPT-4 can deceive under certain prompts in a large online test, achieving a best SR around but not surpassing the human baseline of . The observed decisions were driven mainly by linguistic style and socioemotional traits rather than classical intelligence, and interrogator experience with LLMs improved detection. The authors call for pre-registered, randomized studies with broader baselines and external tools to better gauge deception risks. The work underscores the continued relevance of naturalistic communication as a testbed for AI, while cautioning against overinterpreting pass rates as evidence of human-level intelligence.

Abstract

We evaluated GPT-4 in a public online Turing test. The best-performing GPT-4 prompt passed in 49.7% of games, outperforming ELIZA (22%) and GPT-3.5 (20%), but falling short of the baseline set by human participants (66%). Participants' decisions were based mainly on linguistic style (35%) and socioemotional traits (27%), supporting the idea that intelligence, narrowly conceived, is not sufficient to pass the Turing test. Participant knowledge about LLMs and number of games played positively correlated with accuracy in detecting AI, suggesting learning and practice as possible strategies to mitigate deception. Despite known limitations as a test of intelligence, we argue that the Turing test continues to be relevant as an assessment of naturalistic communication and deception. AI models with the ability to masquerade as humans could have widespread societal consequences, and we analyse the effectiveness of different strategies and criteria for judging humanlikeness.
Paper Structure (37 sections, 5 equations, 21 figures, 3 tables)

This paper contains 37 sections, 5 equations, 21 figures, 3 tables.

Figures (21)

  • Figure 1: Chat interface for the Turing test experiment featuring an example conversation between a human Interrogator (in green) and GPT-4.
  • Figure 2: Turing test Success Rate (SR) for a subset of witnesses. Human witnesses performed best with 66% SR. GPT-4 SR varied greatly by prompt from 50% (Dragon) to 6% (India). ELIZA achieved 22%, outperforming the best GPT-3.5 prompt (November, 20%), and the GPT-4 AI21 baseline prompt (21%).
  • Figure 3: The best-performing prompt, Dragon, used to instruct LLMs on how to respond to users.
  • Figure 4: Four example extracts from game conversations. Interrogators' messages are on the right (green). Footers contain the verdict, confidence, and justification given by the interrogator, and the true identity of the witness.
  • Figure 5: Interrogator accuracy in deciding whether the witness was human or an AI was positively correlated with knowledge about LLMs and number of games played, but not education or frequency of chatbot interaction.
  • ...and 16 more figures