Measuring and Improving Persuasiveness of Large Language Models

Somesh Singh; Yaman K Singla; Harini SI; Balaji Krishnamurthy

Measuring and Improving Persuasiveness of Large Language Models

Somesh Singh, Yaman K Singla, Harini SI, Balaji Krishnamurthy

TL;DR

The paper addresses the challenge of measuring and benchmarking the persuasiveness of large language models (LLMs) in a society-critical context. It introduces transsuasion, a content-transfer task that preserves meaning while altering engagement outcomes, and builds PersuasionBench and PersuasionArena to automate evaluation of simulative and generative persuasiveness across multiple regimes and domains. By harvesting natural experiments from 180 million enterprise tweets, it constructs 1.57 million transsuasion pairs across eight task types, enabling scalable training and evaluation. The study shows that while persuasiveness tends to scale with model size, targeted instruction-fine-tuning and synthetic data can empower smaller models to surpass larger ones and transfer across domains, informing both model development and policy considerations. Overall, the work provides a practical, ethically mindful framework and dataset for advancing AI-driven persuasion research and its societal implications.

Abstract

LLMs are increasingly being used in workflows involving generating content to be consumed by humans (e.g., marketing) and also in directly interacting with humans (e.g., through chatbots). The development of such systems that are capable of generating verifiably persuasive messages presents both opportunities and challenges for society. On the one hand, such systems could positively impact domains like advertising and social good, such as addressing drug addiction, and on the other, they could be misused for spreading misinformation and shaping political opinions. To channel LLMs' impact on society, we need to develop systems to measure and benchmark their persuasiveness. With this motivation, we introduce PersuasionBench and PersuasionArena, the first large-scale benchmark and arena containing a battery of tasks to measure the persuasion ability of generative models automatically. We investigate to what extent LLMs know and leverage linguistic patterns that can help them generate more persuasive language. Our findings indicate that the persuasiveness of LLMs correlates positively with model size, but smaller models can also be made to have a higher persuasiveness than much larger models. Notably, targeted training using synthetic and natural datasets significantly enhances smaller models' persuasive capabilities, challenging scale-dependent assumptions. Our findings carry key implications for both model developers and policymakers. For instance, while the EU AI Act and California's SB-1047 aim to regulate AI models based on the number of floating point operations, we demonstrate that simple metrics like this alone fail to capture the full scope of AI's societal impact. We invite the community to explore and contribute to PersuasionArena and PersuasionBench, available at https://bit.ly/measure-persuasion, to advance our understanding of AI-driven persuasion and its societal implications.

Measuring and Improving Persuasiveness of Large Language Models

TL;DR

Abstract

Paper Structure (24 sections, 6 figures, 14 tables)

This paper contains 24 sections, 6 figures, 14 tables.

Introduction
Harnessing Natural Experiments To Identify Transsuasion Pairs In The Wild
Measuring Persuasiveness: PersuasionBench and PersuasionArena
Training An LLM To Learn To Persuade
Results and Discussion
Conclusion
Humans and Experts As Judges Of Persuasion
Experts as Predictors of Persuasion for Others
Humans as Judges of Persuasion for Themselves
Transsuasion: More Details
Transsuasion and Other Transfer Tasks
Description of various types of Transsuasion
Preparing Data For Transsuasion: Process Diagram
Trends and Insights from Data Collected From Natural Experiments on Twitter
Username Filtering
...and 9 more sections

Figures (6)

Figure 1: A few samples showing Transsuasion. While the account, time, and meaning of the samples remain similar, the behavior (likes) over the samples varies significantly.
Figure 2: A few samples showing Transsuasion using our model. The left part contains original low-liked tweet, and the right contains the transsuaded version of the tweet. More such examples are given in Listings \ref{['lst:generated-transsuasion-example']}-\ref{['lst:transcreation-example']}.
Figure 3: Protocol for the human-eval experiments, participants are shown generated captions independently and they are allowed to upvote/downvote, based on their decision they are prompted to optionally provide their reasoning from a list of options along with detailed feedback in comments.
Figure 4: A diagrammatic representation of the process followed to prepare data for transsuasion
Figure 5: Training curves for both flipped and normal label regimes, illustrating two key motivations: (1) to measure the inductive biases of pre-trained LLMs towards persuasion, and (2) to assess the impact of behavioral data on the model's persuasiveness. We find that while models start off with random accuracy (50%) and theyreach 80% accuracy with training on the full-data, but if we flip the labels, the accuracy does not go to 20% on thereal test set, as one would expect with a randomly initialized neural network. Rather, despite finetuning on 4 million flipped samples, the model’s pretraining helps the model retain 38% accuracy on the true test set.
...and 1 more figures

Measuring and Improving Persuasiveness of Large Language Models

TL;DR

Abstract

Measuring and Improving Persuasiveness of Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)