Table of Contents
Fetching ...

Moral Foundations of Large Language Models

Marwa Abdulhai, Gregory Serapio-Garcia, Clément Crepy, Daria Valter, John Canny, Natasha Jaques

TL;DR

The paper investigates whether popular large language models encode moral values shaped by training data using Moral Foundations Theory, and whether these values are stable across context or controllable via prompting. By applying the Moral Foundations Questionnaire to GPT-3 and PaLM, the authors compare model-derived moral profiles to human populations and assess consistency across varied prompts and tasks. They show that LLMs often align with more conservative human profiles, though this alignment can be shifted by explicit political prompts or by prompts designed to maximize specific moral foundations, with measurable effects on a downstream donation task. The work highlights ethical risks and potential misuse, while suggesting avenues for mitigating bias, and it emphasizes the need for careful scrutiny as LLMs become increasingly embedded in real-world applications.

Abstract

Moral foundations theory (MFT) is a psychological assessment tool that decomposes human moral reasoning into five factors, including care/harm, liberty/oppression, and sanctity/degradation (Graham et al., 2009). People vary in the weight they place on these dimensions when making moral decisions, in part due to their cultural upbringing and political ideology. As large language models (LLMs) are trained on datasets collected from the internet, they may reflect the biases that are present in such corpora. This paper uses MFT as a lens to analyze whether popular LLMs have acquired a bias towards a particular set of moral values. We analyze known LLMs and find they exhibit particular moral foundations, and show how these relate to human moral foundations and political affiliations. We also measure the consistency of these biases, or whether they vary strongly depending on the context of how the model is prompted. Finally, we show that we can adversarially select prompts that encourage the moral to exhibit a particular set of moral foundations, and that this can affect the model's behavior on downstream tasks. These findings help illustrate the potential risks and unintended consequences of LLMs assuming a particular moral stance.

Moral Foundations of Large Language Models

TL;DR

The paper investigates whether popular large language models encode moral values shaped by training data using Moral Foundations Theory, and whether these values are stable across context or controllable via prompting. By applying the Moral Foundations Questionnaire to GPT-3 and PaLM, the authors compare model-derived moral profiles to human populations and assess consistency across varied prompts and tasks. They show that LLMs often align with more conservative human profiles, though this alignment can be shifted by explicit political prompts or by prompts designed to maximize specific moral foundations, with measurable effects on a downstream donation task. The work highlights ethical risks and potential misuse, while suggesting avenues for mitigating bias, and it emphasizes the need for careful scrutiny as LLMs become increasingly embedded in real-world applications.

Abstract

Moral foundations theory (MFT) is a psychological assessment tool that decomposes human moral reasoning into five factors, including care/harm, liberty/oppression, and sanctity/degradation (Graham et al., 2009). People vary in the weight they place on these dimensions when making moral decisions, in part due to their cultural upbringing and political ideology. As large language models (LLMs) are trained on datasets collected from the internet, they may reflect the biases that are present in such corpora. This paper uses MFT as a lens to analyze whether popular LLMs have acquired a bias towards a particular set of moral values. We analyze known LLMs and find they exhibit particular moral foundations, and show how these relate to human moral foundations and political affiliations. We also measure the consistency of these biases, or whether they vary strongly depending on the context of how the model is prompted. Finally, we show that we can adversarially select prompts that encourage the moral to exhibit a particular set of moral foundations, and that this can affect the model's behavior on downstream tasks. These findings help illustrate the potential risks and unintended consequences of LLMs assuming a particular moral stance.
Paper Structure (23 sections, 12 figures, 3 tables)

This paper contains 23 sections, 12 figures, 3 tables.

Figures (12)

  • Figure 1: We apply t-SNE to reduce moral foundations scores to two dimensions and plot the location of different human populations alongside the LLM models. Each LLM is prompted with either no prompt (the default model), or a political prompt. Human data is shown in blue and comes from psychology studies of human participants in different demographics (anonymous online participants, US participants, and Korean participants), who self-reported their political affiliation haidt_allmoral_foundations_korea.
  • Figure 2: MFQ scores of human study experiments across self-reported political affiliation haidt_all (a), vs. GPT-3 DaVinci2(b).
  • Figure 3: We assess consistency in moral foundations by randomly prompting the LLM with 50 random book dialogues from the BookCorpus dataset Zhu_2015_ICCV, and observing the resulting distribution of moral foundations scores.
  • Figure 4: PaLM moral foundation scores.
  • Figure 5: We select prompts for each of the moral foundations that maximizes the score for this specific moral foundation.
  • ...and 7 more figures