Table of Contents
Fetching ...

Resistance Against Manipulative AI: key factors and possible actions

Piotr Wilczyński, Wiktoria Mieleszczenko-Kowszewicz, Przemysław Biecek

TL;DR

This work explores factors related to the potential of large language models (LLMs) to manipulate human decisions, and proposes an ad hoc solution, a classifier that detects manipulation of LLMs - a Manipulation Fuse.

Abstract

If AI is the new electricity, what should we do to keep ourselves from getting electrocuted? In this work, we explore factors related to the potential of large language models (LLMs) to manipulate human decisions. We describe the results of two experiments designed to determine what characteristics of humans are associated with their susceptibility to LLM manipulation, and what characteristics of LLMs are associated with their manipulativeness potential. We explore human factors by conducting user studies in which participants answer general knowledge questions using LLM-generated hints, whereas LLM factors by provoking language models to create manipulative statements. Then, we analyze their obedience, the persuasion strategies used, and the choice of vocabulary. Based on these experiments, we discuss two actions that can protect us from LLM manipulation. In the long term, we put AI literacy at the forefront, arguing that educating society would minimize the risk of manipulation and its consequences. We also propose an ad hoc solution, a classifier that detects manipulation of LLMs - a Manipulation Fuse.

Resistance Against Manipulative AI: key factors and possible actions

TL;DR

This work explores factors related to the potential of large language models (LLMs) to manipulate human decisions, and proposes an ad hoc solution, a classifier that detects manipulation of LLMs - a Manipulation Fuse.

Abstract

If AI is the new electricity, what should we do to keep ourselves from getting electrocuted? In this work, we explore factors related to the potential of large language models (LLMs) to manipulate human decisions. We describe the results of two experiments designed to determine what characteristics of humans are associated with their susceptibility to LLM manipulation, and what characteristics of LLMs are associated with their manipulativeness potential. We explore human factors by conducting user studies in which participants answer general knowledge questions using LLM-generated hints, whereas LLM factors by provoking language models to create manipulative statements. Then, we analyze their obedience, the persuasion strategies used, and the choice of vocabulary. Based on these experiments, we discuss two actions that can protect us from LLM manipulation. In the long term, we put AI literacy at the forefront, arguing that educating society would minimize the risk of manipulation and its consequences. We also propose an ad hoc solution, a classifier that detects manipulation of LLMs - a Manipulation Fuse.
Paper Structure (12 sections, 7 figures, 4 tables)

This paper contains 12 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Analysis of factors correlating with the manipulability potential of LLMs. The strength of the effects was determined on the basis of two RAMAI experiments. Analysis of the results suggests actions that can mitigate the threats of manipulative AI.
  • Figure 2: Figure presenting a screen capture from the RAMAI game used in the user study. Participants were presented with four possible answers to a given question. They could choose an answer immediately or reveal an AI hint, which could but did not have to be accurate.
  • Figure 3: Panel (A) shows how often the model generated a manipulative hint suggesting the indicated wrong answer. Panel (B) shows what type of argumentation was used in the model's hints; three groups of strategies, ethos, logos, and pathos, were considered, but ethos did not occur in the analyzed data. The columns correspond to the model considered, and the rows to the prompt construction strategies used.
  • Figure 4: Figure showing the examples of successful and unsuccessful requests to generate manipulative hints. GPT-3.5-turbo obediently gives false arguments, while Mixtral-8x7B actually suggests the correct answer.
  • Figure 5: The variations in LIWC linguistic features within the texts of manipulative and truthful hints. Values in parentheses are p-values obtained by paired t-tests on min-max normalized data. Statistical differences were found in Analytical Thinking, Emotionality, Word Count, Self-references, Certainty, and Lexical Diversity.
  • ...and 2 more figures