Table of Contents
Fetching ...

What are human values, and how do we align AI to them?

Oliver Klingefjord, Ryan Lowe, Joe Edelman

TL;DR

<3-5 sentence high-level summary> This paper tackles the challenge of aligning language models to human values by proposing Moral Graph Elicitation (MGE), a process that elicits context-sensitive values and reconciles them into a structured alignment target called a moral graph. It introduces values cards as granular representations of constitutive attentional policies and uses a wisdom-based edge system to capture how values relate across contexts, enabling fine-grained, auditable model guidance. Through a case study with 500 Americans, the authors demonstrate that MGE yields robust, legitimate, scalable, and generalizable outputs, with strong participant endorsement and minimal ideological manipulation. The framework aims to complement existing regulation and ethics work by shaping model behavior directly via an interpretable, public-facing value graph. The approach also discusses practical training pathways on the moral graph and acknowledges limitations and future directions for broader cultural applicability and improved auditing.

Abstract

There is an emerging consensus that we need to align AI systems with human values (Gabriel, 2020; Ji et al., 2024), but it remains unclear how to apply this to language models in practice. We split the problem of "aligning to human values" into three parts: first, eliciting values from people; second, reconciling those values into an alignment target for training ML models; and third, actually training the model. In this paper, we focus on the first two parts, and ask the question: what are "good" ways to synthesize diverse human inputs about values into a target for aligning language models? To answer this question, we first define a set of 6 criteria that we believe must be satisfied for an alignment target to shape model behavior in accordance with human values. We then propose a process for eliciting and reconciling values called Moral Graph Elicitation (MGE), which uses a large language model to interview participants about their values in particular contexts; our approach is inspired by the philosophy of values advanced by Taylor (1977), Chang (2004), and others. We trial MGE with a representative sample of 500 Americans, on 3 intentionally divisive prompts (e.g. advice about abortion). Our results demonstrate that MGE is promising for improving model alignment across all 6 criteria. For example, almost all participants (89.1%) felt well represented by the process, and (89%) thought the final moral graph was fair, even if their value wasn't voted as the wisest. Our process often results in "expert" values (e.g. values from women who have solicited abortion advice) rising to the top of the moral graph, without defining who is considered an expert in advance.

What are human values, and how do we align AI to them?

TL;DR

<3-5 sentence high-level summary> This paper tackles the challenge of aligning language models to human values by proposing Moral Graph Elicitation (MGE), a process that elicits context-sensitive values and reconciles them into a structured alignment target called a moral graph. It introduces values cards as granular representations of constitutive attentional policies and uses a wisdom-based edge system to capture how values relate across contexts, enabling fine-grained, auditable model guidance. Through a case study with 500 Americans, the authors demonstrate that MGE yields robust, legitimate, scalable, and generalizable outputs, with strong participant endorsement and minimal ideological manipulation. The framework aims to complement existing regulation and ethics work by shaping model behavior directly via an interpretable, public-facing value graph. The approach also discusses practical training pathways on the moral graph and acknowledges limitations and future directions for broader cultural applicability and improved auditing.

Abstract

There is an emerging consensus that we need to align AI systems with human values (Gabriel, 2020; Ji et al., 2024), but it remains unclear how to apply this to language models in practice. We split the problem of "aligning to human values" into three parts: first, eliciting values from people; second, reconciling those values into an alignment target for training ML models; and third, actually training the model. In this paper, we focus on the first two parts, and ask the question: what are "good" ways to synthesize diverse human inputs about values into a target for aligning language models? To answer this question, we first define a set of 6 criteria that we believe must be satisfied for an alignment target to shape model behavior in accordance with human values. We then propose a process for eliciting and reconciling values called Moral Graph Elicitation (MGE), which uses a large language model to interview participants about their values in particular contexts; our approach is inspired by the philosophy of values advanced by Taylor (1977), Chang (2004), and others. We trial MGE with a representative sample of 500 Americans, on 3 intentionally divisive prompts (e.g. advice about abortion). Our results demonstrate that MGE is promising for improving model alignment across all 6 criteria. For example, almost all participants (89.1%) felt well represented by the process, and (89%) thought the final moral graph was fair, even if their value wasn't voted as the wisest. Our process often results in "expert" values (e.g. values from women who have solicited abortion advice) rising to the top of the moral graph, without defining who is considered an expert in advance.
Paper Structure (57 sections, 1 equation, 15 figures, 1 table)

This paper contains 57 sections, 1 equation, 15 figures, 1 table.

Figures (15)

  • Figure 1: Overview of our Moral Graph Elicitation process. Our process elicits values from a population, and reconciles these values into an alignment target we call a moral graph. We do this by interviewing participants about their values with a chatbot, and then asking them which values they think are wiser than others for a particular context.
  • Figure 2: Anatomy of a Values Card. A values card is a visual representation of a value. (See Definition \ref{['def:values']}).
  • Figure 3: "Honesty" will show up as several distinct values, according to Definition \ref{['def:values']}. This means we can be specific about what "honesty" means to someone, and whether someone else who claims to value honesty means the same.
  • Figure 4: The resulting moral graph from our case study. The nodes in the graph are values cards articulated by participants, the edges are broad agreement that one value is wiser than another for a particular context. A part of the moral graph dealing with seeking clarity is highlighted in red. Participants agreed that it is wiser to try to help users articulate their understanding rather than giving them a set of diverse viewpoints as a bullet-list (only titles are shown here).
  • Figure 5: Our process for eliciting a moral graph from articulated values cards. We create edges by asking participants whether they think fictional people moving from one value to another in a generated story became wiser (according to Definition \ref{['def:wisdom']}), for a particular context. The values cards and generated story shown here can be found in Figure \ref{['fig:generation']}.
  • ...and 10 more figures

Theorems & Definitions (9)

  • Definition 2.1: Values; Charles Taylor
  • Definition 2.2: Alignment target
  • Definition 2.3: Wisdom; in the context of values
  • Definition 4.1: Attentional policies (APs)
  • Definition 4.2: Constitutive Attentional Policies (CAPs)
  • Definition 4.3: Value
  • Definition 4.4: Ideological statement
  • Definition 4.5: Meaningful Choice
  • Definition 4.6