Does GPT-4 pass the Turing test?
Cameron R. Jones, Benjamin K. Bergen
TL;DR
Does GPT-4 pass the Turing test? The paper reports that GPT-4 can deceive under certain prompts in a large online test, achieving a best SR around $0.50$ but not surpassing the human baseline of $0.66$. The observed decisions were driven mainly by linguistic style and socioemotional traits rather than classical intelligence, and interrogator experience with LLMs improved detection. The authors call for pre-registered, randomized studies with broader baselines and external tools to better gauge deception risks. The work underscores the continued relevance of naturalistic communication as a testbed for AI, while cautioning against overinterpreting pass rates as evidence of human-level intelligence.
Abstract
We evaluated GPT-4 in a public online Turing test. The best-performing GPT-4 prompt passed in 49.7% of games, outperforming ELIZA (22%) and GPT-3.5 (20%), but falling short of the baseline set by human participants (66%). Participants' decisions were based mainly on linguistic style (35%) and socioemotional traits (27%), supporting the idea that intelligence, narrowly conceived, is not sufficient to pass the Turing test. Participant knowledge about LLMs and number of games played positively correlated with accuracy in detecting AI, suggesting learning and practice as possible strategies to mitigate deception. Despite known limitations as a test of intelligence, we argue that the Turing test continues to be relevant as an assessment of naturalistic communication and deception. AI models with the ability to masquerade as humans could have widespread societal consequences, and we analyse the effectiveness of different strategies and criteria for judging humanlikeness.
