Table of Contents
Fetching ...

Policy Aggregation

Parand A. Alamdari, Soroush Ebadian, Ariel D. Procaccia

TL;DR

A key insight is that social choice methods can be reinterpreted by identifying ordinal preferences with volumes of subsets of the state-action occupancy polytope, and it is demonstrated that a variety of methods can be practically applied to policy aggregation.

Abstract

We consider the challenge of AI value alignment with multiple individuals that have different reward functions and optimal policies in an underlying Markov decision process. We formalize this problem as one of policy aggregation, where the goal is to identify a desirable collective policy. We argue that an approach informed by social choice theory is especially suitable. Our key insight is that social choice methods can be reinterpreted by identifying ordinal preferences with volumes of subsets of the state-action occupancy polytope. Building on this insight, we demonstrate that a variety of methods--including approval voting, Borda count, the proportional veto core, and quantile fairness--can be practically applied to policy aggregation.

Policy Aggregation

TL;DR

A key insight is that social choice methods can be reinterpreted by identifying ordinal preferences with volumes of subsets of the state-action occupancy polytope, and it is demonstrated that a variety of methods can be practically applied to policy aggregation.

Abstract

We consider the challenge of AI value alignment with multiple individuals that have different reward functions and optimal policies in an underlying Markov decision process. We formalize this problem as one of policy aggregation, where the goal is to identify a desirable collective policy. We argue that an approach informed by social choice theory is especially suitable. Our key insight is that social choice methods can be reinterpreted by identifying ordinal preferences with volumes of subsets of the state-action occupancy polytope. Building on this insight, we demonstrate that a variety of methods--including approval voting, Borda count, the proportional veto core, and quantile fairness--can be practically applied to policy aggregation.

Paper Structure

This paper contains 16 sections, 8 theorems, 14 equations, 4 figures, 1 table, 4 algorithms.

Key Result

Theorem 1

Let $\epsilon \in (0, 1/n)$. For a policy aggregation problem, the $\epsilon$-proportional veto core is nonempty. Furthermore, such policies can be found in polynomial time using $O(\log(1/\epsilon))$ many calls per agent to $\mathrm{vol\text{-}comp}$.

Figures (4)

  • Figure 1: Comparison of policies optimized by different rules in two different scenarios based on the normalized expected return for agents. The bars, grouped by rule, correspond to agents sorted based on their normalized expected return. The error bars show the standard error of the mean.
  • Figure 2: The effective state-action occupancy polytope of agents and their expected return distribution.
  • Figure : Seq. $\epsilon$-Prop. Veto Core CMYL+24
  • Figure : $\alpha$-Approvals MILP

Theorems & Definitions (20)

  • Definition 1: state-action occupancy measure
  • Definition 2: state-action occupancy polytope puterman2014markovzahavy2021reward
  • Definition 3: Pareto optimality
  • Definition 4: expected return distribution
  • Definition 5: proportional veto core
  • Theorem 1
  • Definition 6: $q$-quantile fairness
  • Lemma 1: Grunbaum's Inequality
  • Theorem 2
  • proof
  • ...and 10 more