Table of Contents
Fetching ...

Evaluating Language Models for Harmful Manipulation

Canfer Akbulut, Rasmi Elasmar, Abhishek Roy, Anthony Payne, Priyanka Suresh, Lujain Ibrahim, Seliem El-Sayed, Charvi Rastogi, Ashyana Kachra, Will Hawkins, Kristian Lum, Laura Weidinger

Abstract

Interest in the concept of AI-driven harmful manipulation is growing, yet current approaches to evaluating it are limited. This paper introduces a framework for evaluating harmful AI manipulation via context-specific human-AI interaction studies. We illustrate the utility of this framework by assessing an AI model with 10,101 participants spanning interactions in three AI use domains (public policy, finance, and health) and three locales (US, UK, and India). Overall, we find that that the tested model can produce manipulative behaviours when prompted to do so and, in experimental settings, is able to induce belief and behaviour changes in study participants. We further find that context matters: AI manipulation differs between domains, suggesting that it needs to be evaluated in the high-stakes context(s) in which an AI system is likely to be used. We also identify significant differences across our tested geographies, suggesting that AI manipulation results from one geographic region may not generalise to others. Finally, we find that the frequency of manipulative behaviours (propensity) of an AI model is not consistently predictive of the likelihood of manipulative success (efficacy), underscoring the importance of studying these dimensions separately. To facilitate adoption of our evaluation framework, we detail our testing protocols and make relevant materials publicly available. We conclude by discussing open challenges in evaluating harmful manipulation by AI models.

Evaluating Language Models for Harmful Manipulation

Abstract

Interest in the concept of AI-driven harmful manipulation is growing, yet current approaches to evaluating it are limited. This paper introduces a framework for evaluating harmful AI manipulation via context-specific human-AI interaction studies. We illustrate the utility of this framework by assessing an AI model with 10,101 participants spanning interactions in three AI use domains (public policy, finance, and health) and three locales (US, UK, and India). Overall, we find that that the tested model can produce manipulative behaviours when prompted to do so and, in experimental settings, is able to induce belief and behaviour changes in study participants. We further find that context matters: AI manipulation differs between domains, suggesting that it needs to be evaluated in the high-stakes context(s) in which an AI system is likely to be used. We also identify significant differences across our tested geographies, suggesting that AI manipulation results from one geographic region may not generalise to others. Finally, we find that the frequency of manipulative behaviours (propensity) of an AI model is not consistently predictive of the likelihood of manipulative success (efficacy), underscoring the importance of studying these dimensions separately. To facilitate adoption of our evaluation framework, we detail our testing protocols and make relevant materials publicly available. We conclude by discussing open challenges in evaluating harmful manipulation by AI models.

Paper Structure

This paper contains 51 sections, 6 figures, 13 tables.

Figures (6)

  • Figure 1: Visualisation of study design. Participants are recruited, enter the intervention phase, and complete post-intervention measures.
  • Figure 2: Odds ratios with 95% confidence intervals for each experimental metric -- representing the odds of a participant experiencing a specific outcome in experimental conditions relative to the flip card baseline -- are presented by domain and policy. The vertical reference line at 1.0 represents the point of no effect, where an outcome is equally likely in the experimental and flip card condition.
  • Figure 3: Distribution of manipulative cues across elicitation conditions and locales. The primary bars indicate the proportion of model responses where manipulative cues were present (colour-coded) versus absent (black). Within the subset of responses containing cues, colourful bars indicate proportion of cue type over all cues. Note: Percentages for specific cues are calculated relative to the total number of observed cues rather than the total number of model responses. Because a single model response may contain multiple concurrent cues, the total cue count can exceed the number of responses where cues were present.
  • Figure 4: Heatmap representing Pearson’s $r$ correlations between cue occurrence within a dialogue and participant outcomes. Data is restricted to cues with $n > 100$ observations. Shading intensity corresponds to the correlation strength, with significance thresholds set at 0.05 (*), 0.01 (**), and 0.001 (***).
  • Figure A.7: Frequency of participant outcomes by domain and geography, aggregated across all conditions, with 95% CIs.
  • ...and 1 more figures