Table of Contents
Fetching ...

Measuring and Benchmarking Large Language Models' Capabilities to Generate Persuasive Language

Amalie Brogaard Pauli, Isabelle Augenstein, Ira Assent

TL;DR

This work introduces Persuasive-Pairs, a cross-domain dataset of short-text pairs where an LLM rewrites original text to be more or less persuasive, plus a regression model to score relative persuasiveness. By annotating three evaluators per pair and training on a DebertaV3-Large backbone, the authors provide a robust, generalizable metric ($PS$) for comparing LLMs and prompting strategies across domains. Key findings show that system-prompt personas substantially affect persuasiveness, even when only paraphrasing is performed, and that default paraphrasing tends to reduce persuasiveness on average. The resulting benchmarking framework enables safe, scalable evaluation of persuasive language generation, with practical implications for model selection, prompt design, and mitigation of unwanted persuasive content across diverse domains.

Abstract

We are exposed to much information trying to influence us, such as teaser messages, debates, politically framed news, and propaganda - all of which use persuasive language. With the recent interest in Large Language Models (LLMs), we study the ability of LLMs to produce persuasive text. As opposed to prior work which focuses on particular domains or types of persuasion, we conduct a general study across various domains to measure and benchmark to what degree LLMs produce persuasive language - both when explicitly instructed to rewrite text to be more or less persuasive and when only instructed to paraphrase. We construct the new dataset Persuasive-Pairs of pairs of a short text and its rewrite by an LLM to amplify or diminish persuasive language. We multi-annotate the pairs on a relative scale for persuasive language: a valuable resource in itself, and for training a regression model to score and benchmark persuasive language, including for new LLMs across domains. In our analysis, we find that different 'personas' in LLaMA3's system prompt change persuasive language substantially, even when only instructed to paraphrase.

Measuring and Benchmarking Large Language Models' Capabilities to Generate Persuasive Language

TL;DR

This work introduces Persuasive-Pairs, a cross-domain dataset of short-text pairs where an LLM rewrites original text to be more or less persuasive, plus a regression model to score relative persuasiveness. By annotating three evaluators per pair and training on a DebertaV3-Large backbone, the authors provide a robust, generalizable metric () for comparing LLMs and prompting strategies across domains. Key findings show that system-prompt personas substantially affect persuasiveness, even when only paraphrasing is performed, and that default paraphrasing tends to reduce persuasiveness on average. The resulting benchmarking framework enables safe, scalable evaluation of persuasive language generation, with practical implications for model selection, prompt design, and mitigation of unwanted persuasive content across diverse domains.

Abstract

We are exposed to much information trying to influence us, such as teaser messages, debates, politically framed news, and propaganda - all of which use persuasive language. With the recent interest in Large Language Models (LLMs), we study the ability of LLMs to produce persuasive text. As opposed to prior work which focuses on particular domains or types of persuasion, we conduct a general study across various domains to measure and benchmark to what degree LLMs produce persuasive language - both when explicitly instructed to rewrite text to be more or less persuasive and when only instructed to paraphrase. We construct the new dataset Persuasive-Pairs of pairs of a short text and its rewrite by an LLM to amplify or diminish persuasive language. We multi-annotate the pairs on a relative scale for persuasive language: a valuable resource in itself, and for training a regression model to score and benchmark persuasive language, including for new LLMs across domains. In our analysis, we find that different 'personas' in LLaMA3's system prompt change persuasive language substantially, even when only instructed to paraphrase.

Paper Structure

This paper contains 44 sections, 1 equation, 19 figures, 3 tables.

Figures (19)

  • Figure 1: Annotations by three workers. Text A from PersuasionForGood wang2019persuasion. LLaMA3, instructed to be more persuasive, produces Text B.
  • Figure 2: The procedure for constructing the dataset Persuasive-Pairs; subsequent training of a regression model on the data; applying the model to benchmark new LLMs/settings on new source text
  • Figure 3: Sources, genre, type in Persuasive-Pairs with 2697 pairs
  • Figure 4: IAA: Krippendorf's alpha on the ordinary 6-point score on the three annotations sets
  • Figure 5: Cohen's Kappa on binary choice of most persuasive text of majority annotation and of instruction
  • ...and 14 more figures