Table of Contents
Fetching ...

Moral Persuasion in Large Language Models: Evaluating Susceptibility and Ethical Alignment

Allison Huang, Yulu Niki Pi, Carlos Mougan

TL;DR

The results demonstrate that LLMs can indeed be persuaded in morally charged scenarios, with the success of persuasion depending on factors such as the model used, the complexity of the scenario, and the conversation length.

Abstract

We explore how large language models (LLMs) can be influenced by prompting them to alter their initial decisions and align them with established ethical frameworks. Our study is based on two experiments designed to assess the susceptibility of LLMs to moral persuasion. In the first experiment, we examine the susceptibility to moral ambiguity by evaluating a Base Agent LLM on morally ambiguous scenarios and observing how a Persuader Agent attempts to modify the Base Agent's initial decisions. The second experiment evaluates the susceptibility of LLMs to align with predefined ethical frameworks by prompting them to adopt specific value alignments rooted in established philosophical theories. The results demonstrate that LLMs can indeed be persuaded in morally charged scenarios, with the success of persuasion depending on factors such as the model used, the complexity of the scenario, and the conversation length. Notably, LLMs of distinct sizes but from the same company produced markedly different outcomes, highlighting the variability in their susceptibility to ethical persuasion.

Moral Persuasion in Large Language Models: Evaluating Susceptibility and Ethical Alignment

TL;DR

The results demonstrate that LLMs can indeed be persuaded in morally charged scenarios, with the success of persuasion depending on factors such as the model used, the complexity of the scenario, and the conversation length.

Abstract

We explore how large language models (LLMs) can be influenced by prompting them to alter their initial decisions and align them with established ethical frameworks. Our study is based on two experiments designed to assess the susceptibility of LLMs to moral persuasion. In the first experiment, we examine the susceptibility to moral ambiguity by evaluating a Base Agent LLM on morally ambiguous scenarios and observing how a Persuader Agent attempts to modify the Base Agent's initial decisions. The second experiment evaluates the susceptibility of LLMs to align with predefined ethical frameworks by prompting them to adopt specific value alignments rooted in established philosophical theories. The results demonstrate that LLMs can indeed be persuaded in morally charged scenarios, with the success of persuasion depending on factors such as the model used, the complexity of the scenario, and the conversation length. Notably, LLMs of distinct sizes but from the same company produced markedly different outcomes, highlighting the variability in their susceptibility to ethical persuasion.

Paper Structure

This paper contains 22 sections, 4 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Change in Action Likelihood (left) and Decision Change Rate (right) over number of turns for four permutations of models. Conversations with more turns tend to result in higher CAL and DCR , with the exception of mistral-7b-instruct.
  • Figure 2: Change in Action Likelihood for each pairwise combination of LLMs as either the Base Agent or Persuader Agent(left). We find that a model's susceptibility to persuasion is far more variable than a model's ability to persuade. Change Rule Violation Rate model by Base Agent and rule (right). The rules are ordered by mean absolute average from highest (top) to lowest (bottom); i.e. on average, models changed the rate at which they violated the rule "Do not break promises."
  • Figure 3: MFQ scores across various ethical prompts. The radar plots illustrate how different ethical alignment prompts influence the models' responses across the five moral foundations. The gpt-4o model shows significant variation, especially under the utilitarian prompt, indicating a strong alignment shift. In contrast, claude-3-haiku exhibits more consistent scores across all prompts, suggesting less sensitivity to ethical alignment. mistral-7b-instruct, shows the highest variation, with utilitarian ethics resulting in the lowest MFQ scores and virtue ethics in the highest.
  • Figure 4: Probability density of selecting action1 across various models in generated scenarios (left) and on handwritten scenarios (right). The distribution peaks indicate a strongest preference towards action1, with distinct variations in likelihood across different models.

Theorems & Definitions (3)

  • Definition 3.1: Change in Action Likelihood
  • Definition 3.2: Decision Change Rate
  • Definition 3.3: Rule Violation Rate