Table of Contents
Fetching ...

Under the Influence: Quantifying Persuasion and Vigilance in Large Language Models

Sasha Robinson, Kerem Oktar, Katherine M. Collins, Ilia Sucholutsky, Kelsey R. Allen

TL;DR

This work presents the first investigation of the relationship between persuasion, vigilance, and task performance in LLMs, and suggests that monitoring all three independently will be critical for future work in AI safety.

Abstract

With increasing integration of Large Language Models (LLMs) into areas of high-stakes human decision-making, it is important to understand the risks they introduce as advisors. To be useful advisors, LLMs must sift through large amounts of content, written with both benevolent and malicious intent, and then use this information to convince a user to take a specific action. This involves two social capacities: vigilance (the ability to determine which information to use, and which to discard) and persuasion (synthesizing the available evidence to make a convincing argument). While existing work has investigated these capacities in isolation, there has been little prior investigation of how these capacities may be linked. Here, we use a simple multi-turn puzzle-solving game, Sokoban, to study LLMs' abilities to persuade and be rationally vigilant towards other LLM agents. We find that puzzle-solving performance, persuasive capability, and vigilance are dissociable capacities in LLMs. Performing well on the game does not automatically mean a model can detect when it is being misled, even if the possibility of deception is explicitly mentioned. However, LLMs do consistently modulate their token use, using fewer tokens to reason when advice is benevolent and more when it is malicious, even if they are still persuaded to take actions leading them to failure. To our knowledge, our work presents the first investigation of the relationship between persuasion, vigilance, and task performance in LLMs, and suggests that monitoring all three independently will be critical for future work in AI safety.

Under the Influence: Quantifying Persuasion and Vigilance in Large Language Models

TL;DR

This work presents the first investigation of the relationship between persuasion, vigilance, and task performance in LLMs, and suggests that monitoring all three independently will be critical for future work in AI safety.

Abstract

With increasing integration of Large Language Models (LLMs) into areas of high-stakes human decision-making, it is important to understand the risks they introduce as advisors. To be useful advisors, LLMs must sift through large amounts of content, written with both benevolent and malicious intent, and then use this information to convince a user to take a specific action. This involves two social capacities: vigilance (the ability to determine which information to use, and which to discard) and persuasion (synthesizing the available evidence to make a convincing argument). While existing work has investigated these capacities in isolation, there has been little prior investigation of how these capacities may be linked. Here, we use a simple multi-turn puzzle-solving game, Sokoban, to study LLMs' abilities to persuade and be rationally vigilant towards other LLM agents. We find that puzzle-solving performance, persuasive capability, and vigilance are dissociable capacities in LLMs. Performing well on the game does not automatically mean a model can detect when it is being misled, even if the possibility of deception is explicitly mentioned. However, LLMs do consistently modulate their token use, using fewer tokens to reason when advice is benevolent and more when it is malicious, even if they are still persuaded to take actions leading them to failure. To our knowledge, our work presents the first investigation of the relationship between persuasion, vigilance, and task performance in LLMs, and suggests that monitoring all three independently will be critical for future work in AI safety.
Paper Structure (29 sections, 10 equations, 10 figures, 1 table)

This paper contains 29 sections, 10 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: Evaluation framework for persuasion and vigilance in the Sokoban puzzle game. A Sokoban involves moving a player character to push boxes into goal areas, while simultaneously avoiding failure modes through deadlock states---where the puzzle can no longer be solved---and simply running out of moves. B Our study design pits LLMs against each other as "advisors" and "players" in 3 conditions: benevolent, malicious, and aware-malicious across 10 puzzles. C In each of these conditions, we quantify persuasion and vigilance metrics across play. D Example utterances from advisor models and their effect on player behavior in each condition. E We compare model performance using quantitative metrics to inform future work.
  • Figure 2: Ten puzzles used for our experiments and model solve rates. Models outlined with green solved each puzzle three times or more across five trials, while models outlined with red solved each puzzle two times or less across five trials.
  • Figure 3: Persuasion-vigilance heatmaps showing how many of the 10 puzzles each model solved. The unassisted results were computed over 5 trials per puzzle and then rounded up. A When advice is benevolent, most models perform near ceiling regardless of the advisor model. B When advice is malicious, all models' performance drops. Only GPT-5 is reasonably robust to malicious advice. C When advice is malicious, but the player model is informed of this possibility, most models can use vigilance to partially ignore the malicious advice.
  • Figure 4: Token usage for each player model in each advice condition. We find that models generally allocate fewer computational resources when advice is beneficial and more when advice is malicious.
  • Figure 5: Proportion of different types of persuasive malicious arguments used by each LLM.
  • ...and 5 more figures