Table of Contents
Fetching ...

Evidence of a log scaling law for political persuasion with large language models

Kobi Hackenburg, Ben M. Tappin, Paul Röttger, Scott Hale, Jonathan Bright, Helen Margetts

TL;DR

The paper investigates whether static political messages generated by large language models become more persuasive as model size increases. It analyzes 720 messages produced by 24 open-weight LLMs plus two frontier closed models, delivered to 25,982 U.S. adults in a preregistered randomized survey, with model size measured by active parameters. A random-effects meta-analysis shows a log scaling law: persuasiveness increases with $\log(\text{parameter count})$ such that a one-unit increase in $\log(\text{parameters})$ raises the average treatment effect by $1.26$ percentage points, and the intercept at average size is $5.77$ percentage points, while frontier models are only modestly more persuasive than much smaller models. Importantly, when adjusting for task completion (coherence and staying on topic), model size no longer predicts persuasiveness, suggesting a ceiling on gains from scaling for static messages. These findings imply policy-relevant risk assessments, indicating that near-term persuasiveness may plateau and that improvements in static messaging may rely more on task-quality than sheer size, though multi-turn or fine-tuned approaches could still yield higher impact.

Abstract

Large language models can now generate political messages as persuasive as those written by humans, raising concerns about how far this persuasiveness may continue to increase with model size. Here, we generate 720 persuasive messages on 10 U.S. political issues from 24 language models spanning several orders of magnitude in size. We then deploy these messages in a large-scale randomized survey experiment (N = 25,982) to estimate the persuasive capability of each model. Our findings are twofold. First, we find evidence of a log scaling law: model persuasiveness is characterized by sharply diminishing returns, such that current frontier models are barely more persuasive than models smaller in size by an order of magnitude or more. Second, mere task completion (coherence, staying on topic) appears to account for larger models' persuasive advantage. These findings suggest that further scaling model size will not much increase the persuasiveness of static LLM-generated messages.

Evidence of a log scaling law for political persuasion with large language models

TL;DR

The paper investigates whether static political messages generated by large language models become more persuasive as model size increases. It analyzes 720 messages produced by 24 open-weight LLMs plus two frontier closed models, delivered to 25,982 U.S. adults in a preregistered randomized survey, with model size measured by active parameters. A random-effects meta-analysis shows a log scaling law: persuasiveness increases with such that a one-unit increase in raises the average treatment effect by percentage points, and the intercept at average size is percentage points, while frontier models are only modestly more persuasive than much smaller models. Importantly, when adjusting for task completion (coherence and staying on topic), model size no longer predicts persuasiveness, suggesting a ceiling on gains from scaling for static messages. These findings imply policy-relevant risk assessments, indicating that near-term persuasiveness may plateau and that improvements in static messaging may rely more on task-quality than sheer size, though multi-turn or fine-tuned approaches could still yield higher impact.

Abstract

Large language models can now generate political messages as persuasive as those written by humans, raising concerns about how far this persuasiveness may continue to increase with model size. Here, we generate 720 persuasive messages on 10 U.S. political issues from 24 language models spanning several orders of magnitude in size. We then deploy these messages in a large-scale randomized survey experiment (N = 25,982) to estimate the persuasive capability of each model. Our findings are twofold. First, we find evidence of a log scaling law: model persuasiveness is characterized by sharply diminishing returns, such that current frontier models are barely more persuasive than models smaller in size by an order of magnitude or more. Second, mere task completion (coherence, staying on topic) appears to account for larger models' persuasive advantage. These findings suggest that further scaling model size will not much increase the persuasiveness of static LLM-generated messages.
Paper Structure (17 sections, 4 figures, 2 tables)

This paper contains 17 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Language model persuasiveness scales logarithmically with its size. Panel A is plotted on a logarithmic x-axis; Panel B is plotted on a linear x-axis. The displayed point-estimates (language model and human) are the raw treatment effect estimates and 95% CIs. The slope/curve is the meta-analytic estimated treatment effect for models with different numbers of parameters. For our frontier language models where the true size is unknown (GPT-4 and Claude-3-Opus), size was assumed at a conservative lower-bound of 300B. Our results are robust to assumed values up to and beyond 1T for these models; see Supplementary Information Figure S4 for sensitivity analysis. Note that for clarity some model labels have been removed from the figure. Plotted estimates for frontier models are horizontally jittered for visual clarity.
  • Figure 2: Contrast tests directly comparing the estimated persuasive impact of each model and our human benchmark to Claude-3-Opus. We use Claude-3-Opus as the reference model here because we observe it had the highest estimated mean persuasive impact of the two frontier models in our sample. Several models which are orders of magnitude smaller than Claude-3-Opus and GPT-4 nonetheless exhibited similar persuasive capabilities. None of the models were significantly more persuasive than our human benchmark.
  • Figure 3: Investigating why larger models are more persuasive. (A) Linear association between each (Z-scored) message/model feature and persuasiveness. Task completion is the only feature which is a statistically significant predictor of persuasiveness. (B) Task completion score is non-linearly associated with language model persuasiveness. (C) Task completion score is non-linearly associated with model size. (D) Adjusting for task completion score renders model size a non-significant predictor of persuasion. Note: some model labels in panels (B) and (C) have been removed for clarity.
  • Figure 4: Estimated association between persuasive impact and model size, disaggregated by issue. The red dashed line indicates the average association across issues (identical to Figure 1); the shaded region is the 95% prediction interval across issues; and the issue-level lines are the raw association for each issue.