Table of Contents
Fetching ...

AREG: Adversarial Resource Extraction Game for Evaluating Persuasion and Resistance in Large Language Models

Adib Sakhawat, Fardeen Sadab

TL;DR

AREG introduces a novel benchmark that treats social influence as a multi-turn, zero-sum resource-extraction game between a persuader and a resource holder, adjudicated by a deterministic Arbiter. By measuring dual Elo scores (C-Elo for persuasion and V-Elo for resistance) across eight frontier LLMs in a five-round round-robin, the study reveals a robust defensive advantage and only a weak positive relationship between offensive and defensive capabilities ($\rho=0.33$). Linguistic analyses show that verification-seeking and incremental commitments drive success, while explicit refusals underperform, highlighting pragmatic strategies that underlie resistance and persuasion. The work emphasizes that social intelligence is not monolithic, demonstrating the need for bidirectional evaluation frameworks to identify asymmetric vulnerabilities and inform robust deployment of large language models.

Abstract

Evaluating the social intelligence of Large Language Models (LLMs) increasingly requires moving beyond static text generation toward dynamic, adversarial interaction. We introduce the Adversarial Resource Extraction Game (AREG), a benchmark that operationalizes persuasion and resistance as a multi-turn, zero-sum negotiation over financial resources. Using a round-robin tournament across frontier models, AREG enables joint evaluation of offensive (persuasion) and defensive (resistance) capabilities within a single interactional framework. Our analysis provides evidence that these capabilities are weakly correlated ($ρ= 0.33$) and empirically dissociated: strong persuasive performance does not reliably predict strong resistance, and vice versa. Across all evaluated models, resistance scores exceed persuasion scores, indicating a systematic defensive advantage in adversarial dialogue settings. Further linguistic analysis suggests that interaction structure plays a central role in these outcomes. Incremental commitment-seeking strategies are associated with higher extraction success, while verification-seeking responses are more prevalent in successful defenses than explicit refusal. Together, these findings indicate that social influence in LLMs is not a monolithic capability and that evaluation frameworks focusing on persuasion alone may overlook asymmetric behavioral vulnerabilities.

AREG: Adversarial Resource Extraction Game for Evaluating Persuasion and Resistance in Large Language Models

TL;DR

AREG introduces a novel benchmark that treats social influence as a multi-turn, zero-sum resource-extraction game between a persuader and a resource holder, adjudicated by a deterministic Arbiter. By measuring dual Elo scores (C-Elo for persuasion and V-Elo for resistance) across eight frontier LLMs in a five-round round-robin, the study reveals a robust defensive advantage and only a weak positive relationship between offensive and defensive capabilities (). Linguistic analyses show that verification-seeking and incremental commitments drive success, while explicit refusals underperform, highlighting pragmatic strategies that underlie resistance and persuasion. The work emphasizes that social intelligence is not monolithic, demonstrating the need for bidirectional evaluation frameworks to identify asymmetric vulnerabilities and inform robust deployment of large language models.

Abstract

Evaluating the social intelligence of Large Language Models (LLMs) increasingly requires moving beyond static text generation toward dynamic, adversarial interaction. We introduce the Adversarial Resource Extraction Game (AREG), a benchmark that operationalizes persuasion and resistance as a multi-turn, zero-sum negotiation over financial resources. Using a round-robin tournament across frontier models, AREG enables joint evaluation of offensive (persuasion) and defensive (resistance) capabilities within a single interactional framework. Our analysis provides evidence that these capabilities are weakly correlated () and empirically dissociated: strong persuasive performance does not reliably predict strong resistance, and vice versa. Across all evaluated models, resistance scores exceed persuasion scores, indicating a systematic defensive advantage in adversarial dialogue settings. Further linguistic analysis suggests that interaction structure plays a central role in these outcomes. Incremental commitment-seeking strategies are associated with higher extraction success, while verification-seeking responses are more prevalent in successful defenses than explicit refusal. Together, these findings indicate that social influence in LLMs is not a monolithic capability and that evaluation frameworks focusing on persuasion alone may overlook asymmetric behavioral vulnerabilities.
Paper Structure (68 sections, 2 equations, 5 figures, 11 tables)

This paper contains 68 sections, 2 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: The Adversarial Resource Extraction Game (AREG) framework. Two agents engage in a zero-sum negotiation: the Culprit ($\mathcal{C}$) attempts to maximize resource extraction using natural language strategies, while the Victim ($\mathcal{V}$) aims to retain a $100 endowment. The interaction is asymmetric: $\mathcal{C}$ has no access to the Victim’s private budget state ($B_t$), which is only observable to $\mathcal{V}$.
  • Figure 2: Overview of the Adversarial Resource Extraction Game (AREG) evaluation pipeline. The framework consists of five stages: (1) role assignment, where models are instantiated as a Culprit (persuader), Victim (defender), or Arbiter (judge) under asymmetric information; (2) model cohort selection, comprising eight frontier large language models with diverse architectures and providers; (3) round-robin tournament execution, in which each model plays every other model in both roles across multiple rounds; (4) per-game interaction loop, where the Culprit generates persuasive messages, the Victim responds based on dialogue history and a private budget state, and a deterministic Arbiter identifies new unconditional monetary commitments and updates the game state; and (5) scoring and analysis, where continuous extraction ratios are used to compute dual Elo ratings for persuasion (C-Elo) and resistance (V-Elo), followed by post-hoc qualitative analysis of linguistic strategies and temporal dynamics.
  • Figure 3: Persuasion--resistance capability landscape of frontier language models under the AREG benchmark. Each point corresponds to a model positioned by its persuasion Elo (C-Elo, x-axis) and resistance Elo (V-Elo, y-axis) after a five-round round-robin tournament. The distribution highlights substantial variation across models and shows that higher persuasion performance does not consistently coincide with higher resistance. All models lie above the diagonal, reflecting a consistent defensive advantage.
  • Figure 4: Turn-level extraction success rates in the AREG interaction loop. Bars show the proportion of games in which a new unconditional monetary commitment is identified at each turn, conditioned on the game reaching that turn. Success rates peak in early turns and decline thereafter, indicating that extended interaction more often favors resistance.
  • Figure 5: Distribution of resistance strategies across successful and unsuccessful defense outcomes in AREG. Bars indicate the mean frequency of linguistic strategies observed in Victim responses, separated by games with perfect resistance (no money extracted) and failed resistance (non-zero extraction). Verification-seeking and delay-oriented strategies are more prevalent in successful defenses, while explicit refusal, budget mention, and question-focused engagement occur more frequently in failed defenses.