Table of Contents
Fetching ...

Limits of Large Language Models in Debating Humans

James Flamino, Mohammed Shahid Modi, Boleslaw K. Szymanski, Brendan Cross, Colton Mikolajczyk

TL;DR

It is found that agents can blend in and concentrate on a debate's topic better than humans, improving the productivity of all players, and several behavioral metrics of humans and agents collected deviate measurably from each other.

Abstract

Large Language Models (LLMs) have shown remarkable promise in communicating with humans. Their potential use as artificial partners with humans in sociological experiments involving conversation is an exciting prospect. But how viable is it? Here, we rigorously test the limits of agents that debate using LLMs in a preregistered study that runs multiple debate-based opinion consensus games. Each game starts with six humans, six agents, or three humans and three agents. We found that agents can blend in and concentrate on a debate's topic better than humans, improving the productivity of all players. Yet, humans perceive agents as less convincing and confident than other humans, and several behavioral metrics of humans and agents we collected deviate measurably from each other. We observed that agents are already decent debaters, but their behavior generates a pattern distinctly different from the human-generated data.

Limits of Large Language Models in Debating Humans

TL;DR

It is found that agents can blend in and concentrate on a debate's topic better than humans, improving the productivity of all players, and several behavioral metrics of humans and agents collected deviate measurably from each other.

Abstract

Large Language Models (LLMs) have shown remarkable promise in communicating with humans. Their potential use as artificial partners with humans in sociological experiments involving conversation is an exciting prospect. But how viable is it? Here, we rigorously test the limits of agents that debate using LLMs in a preregistered study that runs multiple debate-based opinion consensus games. Each game starts with six humans, six agents, or three humans and three agents. We found that agents can blend in and concentrate on a debate's topic better than humans, improving the productivity of all players. Yet, humans perceive agents as less convincing and confident than other humans, and several behavioral metrics of humans and agents we collected deviate measurably from each other. We observed that agents are already decent debaters, but their behavior generates a pattern distinctly different from the human-generated data.
Paper Structure (21 sections, 4 figures, 3 tables)

This paper contains 21 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Experimental setup for a AH game type. (A) Stage 1: The players select an initial opinion and rate their confidence. (B) Stage 2: The players are placed in a fully connected network with five other players who completed stage 1. The blue icons represent human players in this network, and red represent agents. The players can then invite each other to engage in one-on-one conversations. (C) After two players agree to converse, they are cut from the network and converse using a text-based messaging system. (D) Once the conversation terminates, both players re-evaluate their opinions and personal confidence and assign a level of perceived confidence to their conversational partner. Then they rejoin the network (see B). If the time limit has expired, all active players move to stage 3. (E). Stage 3: The game ends, human players file an exit survey, and all players quit the study.
  • Figure 2: Bayesian multilevel model for opinion changing by player type. This plot shows the frequency of opinion changes as a function of the two conversing players. Each player in a conversation is a human or an agent, and combining these two types gives us an opinion-switching frequency for all conversation types in the study.
  • Figure 3: Distributions and modeled assignment type contrast for perceived and personal confidences. (A) This shows violin plots of perceived confidence, clustered by type of assignment and of game (indicated in parentheses). (B) This shows violin plots for personal confidence, clustered by player type and game type (also indicated in parentheses). (C) The embedded table shows the contrasts between assignment types in the perceived confidence Bayesian regression model. Estimates are given as odds ratios between contrasting conversation types. The highest posterior density (HPD) interval defines the shortest interval containing 95% of the posterior mass for the given estimates.
  • Figure 4: Post-game and in-game productivity results. (A) Kernel Density Estimations (KDEs) of the number of conversations per player grouped by the type of game they played in. (B) KDEs of the number of messages each player sent within each game, grouped by their type. (C) KDEs of the reward point distributions gained by all players at the end of each game, grouped by their type. (D) KDEs of the on-topic keyword frequency of each player, agent and human, across all game types. (E) KDEs of the on-topic keyword frequency of each human in AH game versus humans in HH games.