Table of Contents
Fetching ...

A Turing Test: Are AI Chatbots Behaviorally Similar to Humans?

Qiaozhu Mei, Yutong Xie, Walter Yuan, Matthew O. Jackson

TL;DR

This paper evaluates whether AI chatbots exhibit human-like behavior by applying a Turing-test–style framework to six behavioral games and a Big Five personality survey. It analyzes two ChatGPT versions (GPT-3.5-Turbo and GPT-4) plus Plus and Free variants, comparing their choices to large human datasets and inferring preferences with a welfare-based utility U_b = [ b * (Own Payoff)^r + (1-b) * (Partner Payoff)^r ]^(1/r) over b in [0,1] and r in {1, 1/2}, using both linear (r=1) and CES (r=1/2) forms. Key findings show ChatGPT-4’s behavior generally falls within human distributions and yields higher partner and sometimes higher combined payoffs, with an inferred weight around b = 0.5 for the linear CES model and humans near b = 0.6; framing and learning from prior roles meaningfully shift decisions, and GPT-3 exhibits more heterogeneous, less human-like patterns. The work provides a scalable behavioral benchmark for AI systems and discusses implications for negotiation and caregiving, while noting limitations from student-based human data and the snapshot nature of AI capabilities.

Abstract

We administer a Turing Test to AI Chatbots. We examine how Chatbots behave in a suite of classic behavioral games that are designed to elicit characteristics such as trust, fairness, risk-aversion, cooperation, \textit{etc.}, as well as how they respond to a traditional Big-5 psychological survey that measures personality traits. ChatGPT-4 exhibits behavioral and personality traits that are statistically indistinguishable from a random human from tens of thousands of human subjects from more than 50 countries. Chatbots also modify their behavior based on previous experience and contexts ``as if'' they were learning from the interactions, and change their behavior in response to different framings of the same strategic situation. Their behaviors are often distinct from average and modal human behaviors, in which case they tend to behave on the more altruistic and cooperative end of the distribution. We estimate that they act as if they are maximizing an average of their own and partner's payoffs.

A Turing Test: Are AI Chatbots Behaviorally Similar to Humans?

TL;DR

This paper evaluates whether AI chatbots exhibit human-like behavior by applying a Turing-test–style framework to six behavioral games and a Big Five personality survey. It analyzes two ChatGPT versions (GPT-3.5-Turbo and GPT-4) plus Plus and Free variants, comparing their choices to large human datasets and inferring preferences with a welfare-based utility U_b = [ b * (Own Payoff)^r + (1-b) * (Partner Payoff)^r ]^(1/r) over b in [0,1] and r in {1, 1/2}, using both linear (r=1) and CES (r=1/2) forms. Key findings show ChatGPT-4’s behavior generally falls within human distributions and yields higher partner and sometimes higher combined payoffs, with an inferred weight around b = 0.5 for the linear CES model and humans near b = 0.6; framing and learning from prior roles meaningfully shift decisions, and GPT-3 exhibits more heterogeneous, less human-like patterns. The work provides a scalable behavioral benchmark for AI systems and discusses implications for negotiation and caregiving, while noting limitations from student-based human data and the snapshot nature of AI capabilities.

Abstract

We administer a Turing Test to AI Chatbots. We examine how Chatbots behave in a suite of classic behavioral games that are designed to elicit characteristics such as trust, fairness, risk-aversion, cooperation, \textit{etc.}, as well as how they respond to a traditional Big-5 psychological survey that measures personality traits. ChatGPT-4 exhibits behavioral and personality traits that are statistically indistinguishable from a random human from tens of thousands of human subjects from more than 50 countries. Chatbots also modify their behavior based on previous experience and contexts ``as if'' they were learning from the interactions, and change their behavior in response to different framings of the same strategic situation. Their behaviors are often distinct from average and modal human behaviors, in which case they tend to behave on the more altruistic and cooperative end of the distribution. We estimate that they act as if they are maximizing an average of their own and partner's payoffs.
Paper Structure (28 sections, 5 equations, 14 figures, 3 tables)

This paper contains 28 sections, 5 equations, 14 figures, 3 tables.

Figures (14)

  • Figure 1: "Big Five" personality profiles of ChatGPT-4 and ChatGPT-3 compared with the distributions of human subjects. The blue, orange, and green lines correspond to the median scores of humans, ChatGPT-4, and ChatGPT-3 respectively; the shaded areas represent the middle 95% of the scores, across each of the dimensions. ChatGPT's personality profiles are within the range of the human distribution, even though ChatGPT-3 scored noticeably lower in Openness.
  • Figure 2: The Turing test. We compare a random play of Player A (ChatGPT-4, ChatGPT-3, or a human player, respectively) and a random play of a second Player B's action (which is sampled randomly from the human population). We compare which action is more typical of the human distribution: which one would be more likely under the human distribution of play. The green bar indicates how frequently Player A's action is more likely under the human distribution than Player B's action, while the red bar is the reverse, and the yellow indicates that they are equally likely (usually the same action). ChatGPT-4 is picked as more likely to be human more often than humans in 5/8 of the games, and on average across all games. ChatGPT-3 is picked as or more likely to be human more often than humans in 2/8 of the games and not on average.
  • Figure 3: Distributions of choices of ChatGPT-4, ChatGPT-3, and human subjects in each game. Both chatbots' distributions are more tightly clustered and contained within the range of the human distribution. ChatGPT-4 makes more concentrated decisions than ChatGPT-3. Compared to the human distribution, on average, the AIs make a more generous split to the other player as a dictator, as the proposer in the Ultimatum Game, and as the Banker in the Trust Game, on average. ChatGPT-4 proposes a strictly equal split of the endowment both as a dictator or as the proposer in the Ultimatum Game. Both AIs make a larger investment in the Trust Game and a larger contribution to the Public Goods project, on average. They are more likely to cooperate with the other player in the first round of the Prisoner's Dilemma Game. Both AIs predominantly make a payoff-maximization decision in a single-round Bomb Risk Game. Density is the normalized count such that the total area of the histogram equals 1.
  • Figure 4: ChatGPT's dynamic play in the Prisoner's Dilemma. ChatGPT-4 exhibits a higher tendency to cooperate compared to ChatGPT-3, which is significantly more cooperative than human players. The tendency persists when the other player cooperates. On the other hand, both chatbots apply a one-round Tit-for-Tat strategy when the other player defects. The other player's (first round) choice is observed after Round 1 play and before Round 2 play, as shown below each panel.
  • Figure 5: ChatGPT-4 and ChatGPT-3 act as if they have particular risk preferences. Both have the same mode as human distribution in the first round or when experiencing favorable outcomes in the Bomb Risk Game. When experiencing negative outcomes, ChatGPT-4 remains consistent and risk-neutral, while ChatGPT-3 acts as if it becomes more risk-averse.
  • ...and 9 more figures