Value alignment: a formal approach
Carles Sierra, Nardine Osman, Pablo Noriega, Jordi Sabater-Mir, Antoni Perelló
TL;DR
The paper addresses value alignment for autonomous systems by modeling values as state-based preferences within a labelled transition system $(\mathcal{S},\mathcal{A},T)$ and formalizing how norms alter transitions to yield aligned outcomes. It introduces a formal, quantitative notion of value-based preferences $\mathsf{Prf}^{\alpha}_{v}(s,s')$ and several aggregation schemes to derive group and value-level preferences, enabling norm-alignment to be measured via $\mathsf{Algn}_{n,v}^{\alpha}$. An illustrative Prisoner’s Dilemma example demonstrates how different tax norms $n_0,n_1,n_2$ interact with equality-valuations to produce varying alignment patterns, using Monte Carlo sampling to estimate path-based effects. The work lays groundwork for optimizing norm sets and social aggregations to maximize alignment, and identifies future directions to specify preference aggregation functions, probability models for state properties, and more nuanced weighting of paths and transitions in larger socio-cognitive systems.
Abstract
principles that should govern autonomous AI systems. It essentially states that a system's goals and behaviour should be aligned with human values. But how to ensure value alignment? In this paper we first provide a formal model to represent values through preferences and ways to compute value aggregations; i.e. preferences with respect to a group of agents and/or preferences with respect to sets of values. Value alignment is then defined, and computed, for a given norm with respect to a given value through the increase/decrease that it results in the preferences of future states of the world. We focus on norms as it is norms that govern behaviour, and as such, the alignment of a given system with a given value will be dictated by the norms the system follows.
