A Turing Test: Are AI Chatbots Behaviorally Similar to Humans?
Qiaozhu Mei, Yutong Xie, Walter Yuan, Matthew O. Jackson
TL;DR
This paper evaluates whether AI chatbots exhibit human-like behavior by applying a Turing-test–style framework to six behavioral games and a Big Five personality survey. It analyzes two ChatGPT versions (GPT-3.5-Turbo and GPT-4) plus Plus and Free variants, comparing their choices to large human datasets and inferring preferences with a welfare-based utility U_b = [ b * (Own Payoff)^r + (1-b) * (Partner Payoff)^r ]^(1/r) over b in [0,1] and r in {1, 1/2}, using both linear (r=1) and CES (r=1/2) forms. Key findings show ChatGPT-4’s behavior generally falls within human distributions and yields higher partner and sometimes higher combined payoffs, with an inferred weight around b = 0.5 for the linear CES model and humans near b = 0.6; framing and learning from prior roles meaningfully shift decisions, and GPT-3 exhibits more heterogeneous, less human-like patterns. The work provides a scalable behavioral benchmark for AI systems and discusses implications for negotiation and caregiving, while noting limitations from student-based human data and the snapshot nature of AI capabilities.
Abstract
We administer a Turing Test to AI Chatbots. We examine how Chatbots behave in a suite of classic behavioral games that are designed to elicit characteristics such as trust, fairness, risk-aversion, cooperation, \textit{etc.}, as well as how they respond to a traditional Big-5 psychological survey that measures personality traits. ChatGPT-4 exhibits behavioral and personality traits that are statistically indistinguishable from a random human from tens of thousands of human subjects from more than 50 countries. Chatbots also modify their behavior based on previous experience and contexts ``as if'' they were learning from the interactions, and change their behavior in response to different framings of the same strategic situation. Their behaviors are often distinct from average and modal human behaviors, in which case they tend to behave on the more altruistic and cooperative end of the distribution. We estimate that they act as if they are maximizing an average of their own and partner's payoffs.
