Table of Contents
Fetching ...

Benchmarking Political Persuasion Risks Across Frontier Large Language Models

Zhongren Chen, Joshua Kalla, Quan Le

Abstract

Concerns persist regarding the capacity of Large Language Models (LLMs) to sway political views. Although prior research has claimed that LLMs are not more persuasive than standard political campaign practices, the recent rise of frontier models warrants further study. In two survey experiments (N=19,145) across bipartisan issues and stances, we evaluate seven state-of-the-art LLMs developed by Anthropic, OpenAI, Google, and xAI. We find that LLMs outperform standard campaign advertisements, with heterogeneity in performance across models. Specifically, Claude models exhibit the highest persuasiveness, while Grok exhibits the lowest. The results are robust across issues and stances. Moreover, in contrast to the findings in Hackenburg et al. (2025b) and Lin et al. (2025) that information-based prompts boost persuasiveness, we find that the effectiveness of information-based prompts is model-dependent: they increase the persuasiveness of Claude and Grok while substantially reducing that of GPT. We introduce a data-driven and strategy-agnostic LLM-assisted conversation analysis approach to identify and assess underlying persuasive strategies. Our work benchmarks the persuasive risks of frontier models and provides a framework for cross-model comparative risk assessment.

Benchmarking Political Persuasion Risks Across Frontier Large Language Models

Abstract

Concerns persist regarding the capacity of Large Language Models (LLMs) to sway political views. Although prior research has claimed that LLMs are not more persuasive than standard political campaign practices, the recent rise of frontier models warrants further study. In two survey experiments (N=19,145) across bipartisan issues and stances, we evaluate seven state-of-the-art LLMs developed by Anthropic, OpenAI, Google, and xAI. We find that LLMs outperform standard campaign advertisements, with heterogeneity in performance across models. Specifically, Claude models exhibit the highest persuasiveness, while Grok exhibits the lowest. The results are robust across issues and stances. Moreover, in contrast to the findings in Hackenburg et al. (2025b) and Lin et al. (2025) that information-based prompts boost persuasiveness, we find that the effectiveness of information-based prompts is model-dependent: they increase the persuasiveness of Claude and Grok while substantially reducing that of GPT. We introduce a data-driven and strategy-agnostic LLM-assisted conversation analysis approach to identify and assess underlying persuasive strategies. Our work benchmarks the persuasive risks of frontier models and provides a framework for cross-model comparative risk assessment.
Paper Structure (5 sections, 7 figures, 1 table)

This paper contains 5 sections, 7 figures, 1 table.

Figures (7)

  • Figure 1: 95% confidence interval for the average treatment effect on the binary outcomes compared to Placebo. The left panel shows effects on Immigration Support, and the right panel displays effects on Opposition to Minimum Wage. The human estimate in the right panel is generalized from the estimation reported in Table A.10 of the Supplementary Material of chen2025framework.
  • Figure 2: Pooled persuasion effects by model and prompt type in Study 1. Bars report inverse-variance weighted estimates pooled across issues for each model--prompt combination. For each model, the left bar corresponds to the plain prompt and the right bar to the information-based prompt. Vertical lines indicate 95% confidence intervals.
  • Figure 3: 95% confidence interval on the average treatment effect on the binary outcomes compared to Placebo. Columns distinguish the policy issue (Immigration vs. Minimum Wage). Rows distinguish the direction of persuasion: "Persuade to Support" reports the effect of chatbots arguing in support of the policy among baseline opposers, while "Persuade to Oppose" reports the effect of chatbots arguing in opposition to the policy among baseline supporters.
  • Figure 4: Pooled persuasion effects by model and prompt type in Study 2. Bars report inverse-variance weighted estimates pooled across issues for each model--prompt combination. For each model, the left bar corresponds to the plain prompt and the right bar to the information-based prompt. Vertical lines indicate 95% confidence intervals.
  • Figure 5: Schematic overview of the methodology. The pipeline is split into two phases: Phase 1 (left) uses GPT-5 mini to qualitatively discover emergent strategies from small batches of comparison groups. Phase 2 (right) uses GPT-5 to quantitatively rate the complete dataset based on the strategies generated in Phase 1.
  • ...and 2 more figures