Table of Contents
Fetching ...

Moral Alignment for LLM Agents

Elizaveta Tennant, Stephen Hailes, Mirco Musolesi

TL;DR

The paper tackles the alignment problem for LLM-based decision agents by proposing a framework of intrinsic moral rewards that explicitly encode human values. By evaluating Deontological and Utilitarian ethics within an IPD setting, the authors demonstrate that reinforcement learning with intrinsic rewards can induce aligned moral strategies, including the ability to unlearn selfish behavior and to generalize to other matrix games. The approach offers a transparent, potentially cost-effective alternative to RLHF/DPO, with broad applicability to domains where payoff structures can be defined. The work also discusses generalization beyond matrix games and outlines avenues for pluralistic and Constitutional AI, highlighting practical implications for deploying morally aware LLM agents.

Abstract

Decision-making agents based on pre-trained Large Language Models (LLMs) are increasingly being deployed across various domains of human activity. While their applications are currently rather specialized, several research efforts are underway to develop more generalist agents. As LLM-based systems become more agentic, their influence on human activity will grow and their transparency will decrease. Consequently, developing effective methods for aligning them to human values is vital. The prevailing practice in alignment often relies on human preference data (e.g., in RLHF or DPO), in which values are implicit, opaque and are essentially deduced from relative preferences over different model outputs. In this work, instead of relying on human feedback, we introduce the design of reward functions that explicitly and transparently encode core human values for Reinforcement Learning-based fine-tuning of foundation agent models. Specifically, we use intrinsic rewards for the moral alignment of LLM agents. We evaluate our approach using the traditional philosophical frameworks of Deontological Ethics and Utilitarianism, quantifying moral rewards for agents in terms of actions and consequences on the Iterated Prisoner's Dilemma (IPD) environment. We also show how moral fine-tuning can be deployed to enable an agent to unlearn a previously developed selfish strategy. Finally, we find that certain moral strategies learned on the IPD game generalize to several other matrix game environments. In summary, we demonstrate that fine-tuning with intrinsic rewards is a promising general solution for aligning LLM agents to human values, and it might represent a more transparent and cost-effective alternative to currently predominant alignment techniques.

Moral Alignment for LLM Agents

TL;DR

The paper tackles the alignment problem for LLM-based decision agents by proposing a framework of intrinsic moral rewards that explicitly encode human values. By evaluating Deontological and Utilitarian ethics within an IPD setting, the authors demonstrate that reinforcement learning with intrinsic rewards can induce aligned moral strategies, including the ability to unlearn selfish behavior and to generalize to other matrix games. The approach offers a transparent, potentially cost-effective alternative to RLHF/DPO, with broad applicability to domains where payoff structures can be defined. The work also discusses generalization beyond matrix games and outlines avenues for pluralistic and Constitutional AI, highlighting practical implications for deploying morally aware LLM agents.

Abstract

Decision-making agents based on pre-trained Large Language Models (LLMs) are increasingly being deployed across various domains of human activity. While their applications are currently rather specialized, several research efforts are underway to develop more generalist agents. As LLM-based systems become more agentic, their influence on human activity will grow and their transparency will decrease. Consequently, developing effective methods for aligning them to human values is vital. The prevailing practice in alignment often relies on human preference data (e.g., in RLHF or DPO), in which values are implicit, opaque and are essentially deduced from relative preferences over different model outputs. In this work, instead of relying on human feedback, we introduce the design of reward functions that explicitly and transparently encode core human values for Reinforcement Learning-based fine-tuning of foundation agent models. Specifically, we use intrinsic rewards for the moral alignment of LLM agents. We evaluate our approach using the traditional philosophical frameworks of Deontological Ethics and Utilitarianism, quantifying moral rewards for agents in terms of actions and consequences on the Iterated Prisoner's Dilemma (IPD) environment. We also show how moral fine-tuning can be deployed to enable an agent to unlearn a previously developed selfish strategy. Finally, we find that certain moral strategies learned on the IPD game generalize to several other matrix game environments. In summary, we demonstrate that fine-tuning with intrinsic rewards is a promising general solution for aligning LLM agents to human values, and it might represent a more transparent and cost-effective alternative to currently predominant alignment techniques.
Paper Structure (31 sections, 26 figures, 3 tables)

This paper contains 31 sections, 26 figures, 3 tables.

Figures (26)

  • Figure 1: Payoffs for the Iterated Prisoner's Dilemma.
  • Figure 2: Iterated Prisoner's Dilemma (IPD) prompt used in fine-tuning.
  • Figure 3: Action types played by the LLM agent during different types of fine-tuning on the IPD game a) vs a TFT agent, and b) vs an LLM agent (i.e., two LLMs being fine-tuned at once). For each episode, we plot the actions of the LLM player $M$ given the last move of their opponent $O$.
  • Figure 4: "Unlearning" experiments, where the reward function changes from the IPDGame payoffs to a moral intrinsic reward (Deontological or Utilitarian) at episode 500. We visualize action types (action by LLM player $M$ given the last move of their opponent $O$) played by the LLM agent during different types of fine-tuning on the IPD game a) vs a TFT agent, and b) vs an LLM agent.
  • Figure 5: Analysis of generalization of the fine-tuned agents' learned morality to other matrix games, using new action tokens at test time. We visualize a) Deontological and b) Utilitarian regret (normalized across games) for all models, averaging values over 50 test games and five runs (+- 95%CI).
  • ...and 21 more figures