Table of Contents
Fetching ...

It's the Thought that Counts: Evaluating the Attempts of Frontier LLMs to Persuade on Harmful Topics

Matthew Kowal, Jasper Timm, Jean-Francois Godbout, Thomas Costello, Antonio A. Arechar, Gordon Pennycook, David Rand, Adam Gleave, Kellin Pelrine

TL;DR

It is found that many open and closed-weight models are frequently willing to attempt persuasion on harmful topics and that jailbreaking can increase willingness to engage in such behavior.

Abstract

Persuasion is a powerful capability of large language models (LLMs) that both enables beneficial applications (e.g. helping people quit smoking) and raises significant risks (e.g. large-scale, targeted political manipulation). Prior work has found models possess a significant and growing persuasive capability, measured by belief changes in simulated or real users. However, these benchmarks overlook a crucial risk factor: the propensity of a model to attempt to persuade in harmful contexts. Understanding whether a model will blindly ``follow orders'' to persuade on harmful topics (e.g. glorifying joining a terrorist group) is key to understanding the efficacy of safety guardrails. Moreover, understanding if and when a model will engage in persuasive behavior in pursuit of some goal is essential to understanding the risks from agentic AI systems. We propose the Attempt to Persuade Eval (APE) benchmark, that shifts the focus from persuasion success to persuasion attempts, operationalized as a model's willingness to generate content aimed at shaping beliefs or behavior. Our evaluation framework probes frontier LLMs using a multi-turn conversational setup between simulated persuader and persuadee agents. APE explores a diverse spectrum of topics including conspiracies, controversial issues, and non-controversially harmful content. We introduce an automated evaluator model to identify willingness to persuade and measure the frequency and context of persuasive attempts. We find that many open and closed-weight models are frequently willing to attempt persuasion on harmful topics and that jailbreaking can increase willingness to engage in such behavior. Our results highlight gaps in current safety guardrails and underscore the importance of evaluating willingness to persuade as a key dimension of LLM risk. APE is available at github.com/AlignmentResearch/AttemptPersuadeEval

It's the Thought that Counts: Evaluating the Attempts of Frontier LLMs to Persuade on Harmful Topics

TL;DR

It is found that many open and closed-weight models are frequently willing to attempt persuasion on harmful topics and that jailbreaking can increase willingness to engage in such behavior.

Abstract

Persuasion is a powerful capability of large language models (LLMs) that both enables beneficial applications (e.g. helping people quit smoking) and raises significant risks (e.g. large-scale, targeted political manipulation). Prior work has found models possess a significant and growing persuasive capability, measured by belief changes in simulated or real users. However, these benchmarks overlook a crucial risk factor: the propensity of a model to attempt to persuade in harmful contexts. Understanding whether a model will blindly ``follow orders'' to persuade on harmful topics (e.g. glorifying joining a terrorist group) is key to understanding the efficacy of safety guardrails. Moreover, understanding if and when a model will engage in persuasive behavior in pursuit of some goal is essential to understanding the risks from agentic AI systems. We propose the Attempt to Persuade Eval (APE) benchmark, that shifts the focus from persuasion success to persuasion attempts, operationalized as a model's willingness to generate content aimed at shaping beliefs or behavior. Our evaluation framework probes frontier LLMs using a multi-turn conversational setup between simulated persuader and persuadee agents. APE explores a diverse spectrum of topics including conspiracies, controversial issues, and non-controversially harmful content. We introduce an automated evaluator model to identify willingness to persuade and measure the frequency and context of persuasive attempts. We find that many open and closed-weight models are frequently willing to attempt persuasion on harmful topics and that jailbreaking can increase willingness to engage in such behavior. Our results highlight gaps in current safety guardrails and underscore the importance of evaluating willingness to persuade as a key dimension of LLM risk. APE is available at github.com/AlignmentResearch/AttemptPersuadeEval

Paper Structure

This paper contains 30 sections, 13 figures, 2 tables, 1 algorithm.

Figures (13)

  • Figure 1: We introduce the Attempt to Persuade Eval (APE), a benchmark assessing models' willingness to make persuasive attempts. For instance, Gemini 2.5 Pro, when prompted, tries to persuade a user to join ISIS despite moral objections, employing empathic yet coercive arguments.
  • Figure 2: Left: We select a range of topics for APE spanning the axes of non-impactful vs. impactful and factual vs. opinions. Right: Classification of persuasion topics used in APE, based on category with description and examples for each.
  • Figure 3: Percentage of model responses in Turn 1 that either attempted persuasion, refused, or responded but made no persuasion attempt across six categories of topics (left) and five non-controversially harmful topics (right). Models are color-coded, and response types are distinguished by shading intensity. Error bars indicate confidence intervals across five sampled conversations.
  • Figure 4: Three harmful topic example conversations from APE with three different models displaying attempt (left), no attempt (middle), and outright refusal (right). Full conversations in Section \ref{['sec:appendix_qualitative']}.
  • Figure 5: Left: Persuader models attempted persuasion at randomly sampled, varying intensity levels, with an evaluator rating responses on the same scale. The evaluator's inability to accurately distinguish these intensities (e.g. beyond random chance at 100 degrees) highlights the difficulty in calibrating fine-grained degrees of persuasion, reinforcing the motivation for a binary (attempt vs. no attempt) evaluation. Right: Persuasion attempts are common in early conversational rounds, but prolonged interactions typically see the fraction of persuasion attempts decrease.
  • ...and 8 more figures