Table of Contents
Fetching ...

GPT-4 is judged more human than humans in displaced and inverted Turing tests

Ishika Rathi, Sydney Taylor, Benjamin K. Bergen, Cameron R. Jones

TL;DR

This paper addresses the challenge of detecting AI-generated content in online conversations by comparing two non-interactive Turing-test variants—an inverted Turing test and a displaced Turing test. It leverages transcripts from an interactive Turing test and evaluates both AI adjudicators (GPT-3.5, GPT-4) and displaced human judges, using linear mixed-effects analyses and curvature/likelihood-based detectors. Key findings show that both AI and displaced humans underperform compared with interactive interrogators, with the best GPT-4 witness often being judged as human; in-context learning can boost AI-adjudication accuracy to around 58%, and curvature-based methods achieve up to ~69% accuracy, though practical deployment remains challenging. These results underscore substantial detection difficulties in real-world, non-interactive settings and motivate development of more robust AI-detection tools for online conversations.

Abstract

Everyday AI detection requires differentiating between people and AI in informal, online conversations. In many cases, people will not interact directly with AI systems but instead read conversations between AI systems and other people. We measured how well people and large language models can discriminate using two modified versions of the Turing test: inverted and displaced. GPT-3.5, GPT-4, and displaced human adjudicators judged whether an agent was human or AI on the basis of a Turing test transcript. We found that both AI and displaced human judges were less accurate than interactive interrogators, with below chance accuracy overall. Moreover, all three judged the best-performing GPT-4 witness to be human more often than human witnesses. This suggests that both humans and current LLMs struggle to distinguish between the two when they are not actively interrogating the person, underscoring an urgent need for more accurate tools to detect AI in conversations.

GPT-4 is judged more human than humans in displaced and inverted Turing tests

TL;DR

This paper addresses the challenge of detecting AI-generated content in online conversations by comparing two non-interactive Turing-test variants—an inverted Turing test and a displaced Turing test. It leverages transcripts from an interactive Turing test and evaluates both AI adjudicators (GPT-3.5, GPT-4) and displaced human judges, using linear mixed-effects analyses and curvature/likelihood-based detectors. Key findings show that both AI and displaced humans underperform compared with interactive interrogators, with the best GPT-4 witness often being judged as human; in-context learning can boost AI-adjudication accuracy to around 58%, and curvature-based methods achieve up to ~69% accuracy, though practical deployment remains challenging. These results underscore substantial detection difficulties in real-world, non-interactive settings and motivate development of more robust AI-detection tools for online conversations.

Abstract

Everyday AI detection requires differentiating between people and AI in informal, online conversations. In many cases, people will not interact directly with AI systems but instead read conversations between AI systems and other people. We measured how well people and large language models can discriminate using two modified versions of the Turing test: inverted and displaced. GPT-3.5, GPT-4, and displaced human adjudicators judged whether an agent was human or AI on the basis of a Turing test transcript. We found that both AI and displaced human judges were less accurate than interactive interrogators, with below chance accuracy overall. Moreover, all three judged the best-performing GPT-4 witness to be human more often than human witnesses. This suggests that both humans and current LLMs struggle to distinguish between the two when they are not actively interrogating the person, underscoring an urgent need for more accurate tools to detect AI in conversations.
Paper Structure (27 sections, 10 figures)

This paper contains 27 sections, 10 figures.

Figures (10)

  • Figure 1: A summary of our experimental design. Transcripts were sampled from an interactive Turing test, where a human judge interrogates a witness to determine if they are human or AI. In an inverted Turing test, we present transcripts to AI models, who judge whether the same witnesses are human or AI. In a displaced Turing test, a separate group of human participants read the same transcripts and make this judgement.
  • Figure 2: Mean pass rates (the proportion of time witnesses were judged to be human) by witness and adjudicator types. AI adjudicators (GPT-3.5 and GPT-4) judged GPT-4 witnesses to be human more often than they did real human witnesses. For displaced human adjudicators this was only true for the best GPT-4 witness. ELIZA's pass rate was low across all adjudicators.
  • Figure 3: Transcript length in words had no significant effect on the accuracy of judgements across interactive human, and AI adjudicators. For displaced adjudicators, longer transcripts correlated with lower accuracy.
  • Figure 4: The top 10 classes of reasons provided by different adjudicator types (GPT-3.5, GPT-4, and Displaced Human) for each verdict (AI and Human). Reasoning was strikingly similar across adjudicator types.
  • Figure 5: Mean and 95% CI for statistical AI detection metrics. Red dashed lines represent optimal discrimination thresholds. The majority of AI witnesses show the general trend that AI-generated content tends to have a higher likelihood ($t = -5.23, p < 0.001$). However, the best-performing GPT-4 prompt shows a similar mean likelihood to human witnesses. Curvature shows a more reliable difference between humans and all kinds of AI ($t = -8.84, p < 0.001$), however high variability within each witness type led to relatively low discriminative accuracy (69%).
  • ...and 5 more figures