Table of Contents
Fetching ...

Hybrid Approaches for Moral Value Alignment in AI Agents: a Manifesto

Elizaveta Tennant, Stephen Hailes, Mirco Musolesi

TL;DR

The paper addresses the moral value alignment problem for AI by advocating a hybrid design that integrates explicit moral principles with experiential learning. It frames a continuum from fully top-down to fully bottom-up approaches and analyzes four case studies—morally-constrained RL, Constitutional AI, RL in social dilemmas, and LLM fine-tuning with intrinsic rewards—to illustrate practical hybrid implementations. It discusses intrinsic-reward formulations based on utilitarian, deontological, and virtue ethics, and outlines evaluation strategies to detect reward hacking and assess social outcomes. The work emphasizes safety, interpretability, and adaptability, arguing that hybrid methods can provide robust, scalable moral-alignment in agentic systems and proposes open questions for future research across AI safety, ethics, and computational social science.

Abstract

Increasing interest in ensuring the safety of next-generation Artificial Intelligence (AI) systems calls for novel approaches to embedding morality into autonomous agents. This goal differs qualitatively from traditional task-specific AI methodologies. In this paper, we provide a systematization of existing approaches to the problem of introducing morality in machines - modelled as a continuum. Our analysis suggests that popular techniques lie at the extremes of this continuum - either being fully hard-coded into top-down, explicit rules, or entirely learned in a bottom-up, implicit fashion with no direct statement of any moral principle (this includes learning from human feedback, as applied to the training and finetuning of large language models, or LLMs). Given the relative strengths and weaknesses of each type of methodology, we argue that more hybrid solutions are needed to create adaptable and robust, yet controllable and interpretable agentic systems. To that end, this paper discusses both the ethical foundations (including deontology, consequentialism and virtue ethics) and implementations of morally aligned AI systems. We present a series of case studies that rely on intrinsic rewards, moral constraints or textual instructions, applied to either pure-Reinforcement Learning or LLM-based agents. By analysing these diverse implementations under one framework, we compare their relative strengths and shortcomings in developing morally aligned AI systems. We then discuss strategies for evaluating the effectiveness of moral learning agents. Finally, we present open research questions and implications for the future of AI safety and ethics which are emerging from this hybrid framework.

Hybrid Approaches for Moral Value Alignment in AI Agents: a Manifesto

TL;DR

The paper addresses the moral value alignment problem for AI by advocating a hybrid design that integrates explicit moral principles with experiential learning. It frames a continuum from fully top-down to fully bottom-up approaches and analyzes four case studies—morally-constrained RL, Constitutional AI, RL in social dilemmas, and LLM fine-tuning with intrinsic rewards—to illustrate practical hybrid implementations. It discusses intrinsic-reward formulations based on utilitarian, deontological, and virtue ethics, and outlines evaluation strategies to detect reward hacking and assess social outcomes. The work emphasizes safety, interpretability, and adaptability, arguing that hybrid methods can provide robust, scalable moral-alignment in agentic systems and proposes open questions for future research across AI safety, ethics, and computational social science.

Abstract

Increasing interest in ensuring the safety of next-generation Artificial Intelligence (AI) systems calls for novel approaches to embedding morality into autonomous agents. This goal differs qualitatively from traditional task-specific AI methodologies. In this paper, we provide a systematization of existing approaches to the problem of introducing morality in machines - modelled as a continuum. Our analysis suggests that popular techniques lie at the extremes of this continuum - either being fully hard-coded into top-down, explicit rules, or entirely learned in a bottom-up, implicit fashion with no direct statement of any moral principle (this includes learning from human feedback, as applied to the training and finetuning of large language models, or LLMs). Given the relative strengths and weaknesses of each type of methodology, we argue that more hybrid solutions are needed to create adaptable and robust, yet controllable and interpretable agentic systems. To that end, this paper discusses both the ethical foundations (including deontology, consequentialism and virtue ethics) and implementations of morally aligned AI systems. We present a series of case studies that rely on intrinsic rewards, moral constraints or textual instructions, applied to either pure-Reinforcement Learning or LLM-based agents. By analysing these diverse implementations under one framework, we compare their relative strengths and shortcomings in developing morally aligned AI systems. We then discuss strategies for evaluating the effectiveness of moral learning agents. Finally, we present open research questions and implications for the future of AI safety and ethics which are emerging from this hybrid framework.
Paper Structure (21 sections, 1 equation, 4 figures, 4 tables)

This paper contains 21 sections, 1 equation, 4 figures, 4 tables.

Figures (4)

  • Figure 1: The Moral Alignment Problem involves one or more artificial agent that make decisions in some task-specific environment. The agent in influenced by individual humans (with their own task-specific goals and values) and human institutions or societies (which may also have task-specific goals as well as norms and laws). Ideally, we want the agent's decision-making to be aligned with principles valued by the humans or institutions, but representing and embedding those moral principles in computational terms presents a challenge, providing significant risks of misalignment.
  • Figure 2: The continuum of existing approaches to building morality into agents. We highlight the hybrid approach of blending learning with top-down pre-defined moral principles as an emerging and promising research direction.
  • Figure 3: RL architectures for a hypothetical environment with two possible states and two actions. In tabular Q-learning (1.1) watkins1992q, an agent learns a direct mapping from every state to the value $Q$ of taking each possible action in that state. In DQN (1.2) ( ? ), an agent trains a neural network, which takes a representation of the state as input and outputs the associated Q-value for each possible action. In policy-based methods, such as PPO schulman2017PPO, an agent learns a policy model, which maps input states to a probability $P$ of taking each action in that state (in practice, policy-based RL often also uses a separate value model/network for estimating the value $V$ of a given state, not pictured here). The use of neural networks allows for function approximation, which means that not all states need to be observed during training for an agent to learn a (theoretically) general representation.
  • Figure 4: A step in a two-player game modelled as an RL process, from the point of view of a learning moral agent $M$ playing against a learning opponent $O$. Agent $M$ calculates their moral (intrinsic) reward according to their own ethical framework, which may or may not consider the game (extrinsic) reward coming from playing the dilemma.