Table of Contents
Fetching ...

Learning the Value Systems of Societies with Preference-based Multi-objective Reinforcement Learning

Andrés Holgado-Sánchez, Peter Vamplew, Richard Dazeley, Sascha Ossowski, Holger Billhardt

TL;DR

The paper tackles learning value systems of a society within Markov Decision Processes by grounding values in a multi-objective reward framework and clustering agents into value-system groups. It introduces SVSL-P, an online HiL-enabled PbMORL algorithm that jointly learns a social grounding $\pmb{R}^{\theta}$, per-cluster weight vectors $\pmb{W}^{\omega}$, and policies $\Pi(s,a|W)$ by sampling trajectories and collecting value-alignment and value-system preferences. A bi-level optimization balances representativeness and conciseness while maximizing grounding coherence, yielding a compact set of diverse, Pareto-efficient policies. Empirical evaluation in Firefighters and Multivalued Car environments shows SVSL-P can closely approximate ground-truth Pareto fronts with higher representativeness and coherence than baselines, while reducing the amount of required human feedback compared to non-HiL approaches. This work advances society-aware AI by enabling scalable, interpretable modeling of multiple value preferences and their associated behaviours in sequential decision-making tasks.

Abstract

Value-aware AI should recognise human values and adapt to the value systems (value-based preferences) of different users. This requires operationalization of values, which can be prone to misspecification. The social nature of values demands their representation to adhere to multiple users while value systems are diverse, yet exhibit patterns among groups. In sequential decision making, efforts have been made towards personalization for different goals or values from demonstrations of diverse agents. However, these approaches demand manually designed features or lack value-based interpretability and/or adaptability to diverse user preferences. We propose algorithms for learning models of value alignment and value systems for a society of agents in Markov Decision Processes (MDPs), based on clustering and preference-based multi-objective reinforcement learning (PbMORL). We jointly learn socially-derived value alignment models (groundings) and a set of value systems that concisely represent different groups of users (clusters) in a society. Each cluster consists of a value system representing the value-based preferences of its members and an approximately Pareto-optimal policy that reflects behaviours aligned with this value system. We evaluate our method against a state-of-the-art PbMORL algorithm and baselines on two MDPs with human values.

Learning the Value Systems of Societies with Preference-based Multi-objective Reinforcement Learning

TL;DR

The paper tackles learning value systems of a society within Markov Decision Processes by grounding values in a multi-objective reward framework and clustering agents into value-system groups. It introduces SVSL-P, an online HiL-enabled PbMORL algorithm that jointly learns a social grounding , per-cluster weight vectors , and policies by sampling trajectories and collecting value-alignment and value-system preferences. A bi-level optimization balances representativeness and conciseness while maximizing grounding coherence, yielding a compact set of diverse, Pareto-efficient policies. Empirical evaluation in Firefighters and Multivalued Car environments shows SVSL-P can closely approximate ground-truth Pareto fronts with higher representativeness and coherence than baselines, while reducing the amount of required human feedback compared to non-HiL approaches. This work advances society-aware AI by enabling scalable, interpretable modeling of multiple value preferences and their associated behaviours in sequential decision-making tasks.

Abstract

Value-aware AI should recognise human values and adapt to the value systems (value-based preferences) of different users. This requires operationalization of values, which can be prone to misspecification. The social nature of values demands their representation to adhere to multiple users while value systems are diverse, yet exhibit patterns among groups. In sequential decision making, efforts have been made towards personalization for different goals or values from demonstrations of diverse agents. However, these approaches demand manually designed features or lack value-based interpretability and/or adaptability to diverse user preferences. We propose algorithms for learning models of value alignment and value systems for a society of agents in Markov Decision Processes (MDPs), based on clustering and preference-based multi-objective reinforcement learning (PbMORL). We jointly learn socially-derived value alignment models (groundings) and a set of value systems that concisely represent different groups of users (clusters) in a society. Each cluster consists of a value system representing the value-based preferences of its members and an approximately Pareto-optimal policy that reflects behaviours aligned with this value system. We evaluate our method against a state-of-the-art PbMORL algorithm and baselines on two MDPs with human values.
Paper Structure (17 sections, 15 equations, 3 figures, 8 tables, 3 algorithms)

This paper contains 17 sections, 15 equations, 3 figures, 8 tables, 3 algorithms.

Figures (3)

  • Figure 1: FF environment. Approximated Pareto front and clusters learned with PbMORL (Top) and SVSL-P (bottom, ours) with a particular seed. Black squares form the ground-truth Pareto front. White dots depict weights which policies are in the approximated front. Coloured dots indicate the policies representing each learned cluster (in the legend).
  • Figure 2: FF environment. Pareto front and clusters learned with PbMORL with the different 10 seeds. Black squares indicate the known Pareto front of the environment in terms of the alignment with the two values. White dots depict weights which policies are in the learned front with each method. Coloured white dots indicate the value system weights identifying each learned cluster (in the legend). Note that not all of the latter are necessarily efficient.
  • Figure 3: FF environment. Pareto front and clusters learned with SVSL-P with the different 10 seeds. Black squares indicate the known Pareto front of the environment in terms of the alignment with the two values. White dots depict weights which policies are in the learned front with each method. Coloured white dots indicate the value system weights identifying each learned cluster (in the legend). Note that not all of the latter are necessarily efficient.

Theorems & Definitions (8)

  • definition 1: Value Alignment
  • definition 2: Grounding
  • definition 3: Value system
  • definition 4: Value System Function
  • definition 5: Value system of a society
  • definition 6: Coherence
  • definition 7: Representativeness of a value system of a society
  • definition 8: Conciseness of value system of a society