Table of Contents
Fetching ...

Value alignment: a formal approach

Carles Sierra, Nardine Osman, Pablo Noriega, Jordi Sabater-Mir, Antoni Perelló

TL;DR

The paper addresses value alignment for autonomous systems by modeling values as state-based preferences within a labelled transition system $(\mathcal{S},\mathcal{A},T)$ and formalizing how norms alter transitions to yield aligned outcomes. It introduces a formal, quantitative notion of value-based preferences $\mathsf{Prf}^{\alpha}_{v}(s,s')$ and several aggregation schemes to derive group and value-level preferences, enabling norm-alignment to be measured via $\mathsf{Algn}_{n,v}^{\alpha}$. An illustrative Prisoner’s Dilemma example demonstrates how different tax norms $n_0,n_1,n_2$ interact with equality-valuations to produce varying alignment patterns, using Monte Carlo sampling to estimate path-based effects. The work lays groundwork for optimizing norm sets and social aggregations to maximize alignment, and identifies future directions to specify preference aggregation functions, probability models for state properties, and more nuanced weighting of paths and transitions in larger socio-cognitive systems.

Abstract

principles that should govern autonomous AI systems. It essentially states that a system's goals and behaviour should be aligned with human values. But how to ensure value alignment? In this paper we first provide a formal model to represent values through preferences and ways to compute value aggregations; i.e. preferences with respect to a group of agents and/or preferences with respect to sets of values. Value alignment is then defined, and computed, for a given norm with respect to a given value through the increase/decrease that it results in the preferences of future states of the world. We focus on norms as it is norms that govern behaviour, and as such, the alignment of a given system with a given value will be dictated by the norms the system follows.

Value alignment: a formal approach

TL;DR

The paper addresses value alignment for autonomous systems by modeling values as state-based preferences within a labelled transition system and formalizing how norms alter transitions to yield aligned outcomes. It introduces a formal, quantitative notion of value-based preferences and several aggregation schemes to derive group and value-level preferences, enabling norm-alignment to be measured via . An illustrative Prisoner’s Dilemma example demonstrates how different tax norms interact with equality-valuations to produce varying alignment patterns, using Monte Carlo sampling to estimate path-based effects. The work lays groundwork for optimizing norm sets and social aggregations to maximize alignment, and identifies future directions to specify preference aggregation functions, probability models for state properties, and more nuanced weighting of paths and transitions in larger socio-cognitive systems.

Abstract

principles that should govern autonomous AI systems. It essentially states that a system's goals and behaviour should be aligned with human values. But how to ensure value alignment? In this paper we first provide a formal model to represent values through preferences and ways to compute value aggregations; i.e. preferences with respect to a group of agents and/or preferences with respect to sets of values. Value alignment is then defined, and computed, for a given norm with respect to a given value through the increase/decrease that it results in the preferences of future states of the world. We focus on norms as it is norms that govern behaviour, and as such, the alignment of a given system with a given value will be dictated by the norms the system follows.

Paper Structure

This paper contains 16 sections, 7 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: General view of an agent's value-guided behaviour
  • Figure 2: The different value-based preferences and the different aggregation functions
  • Figure 3: Applying a norm to a given world alters the transitions and their resulting states
  • Figure 4: Applying norms alters the world

Theorems & Definitions (6)

  • definition thmcounterdefinition
  • definition thmcounterdefinition
  • definition thmcounterdefinition
  • definition thmcounterdefinition
  • definition thmcounterdefinition
  • definition thmcounterdefinition