Table of Contents
Fetching ...

Multi-Principal Assistance Games

Arnaud Fickinger, Simon Zhuang, Dylan Hadfield-Menell, Stuart Russell

TL;DR

The paper addresses value alignment in settings with multiple human principals, where diverse payoffs and strategic behavior impede learning. It analyzes impossibility results for learning from multiple humans via MPAL and proposes mechanism-design-inspired alternatives, including MPBA and voting-by-demonstrating, to align actions with social welfare. Through theoretical results and simulations, it shows manipulation risks under standard IRL and demonstrates how carefully designed shared-control mechanisms can mitigate manipulation while delivering near-optimal social outcomes. The work provides a principled framework for integrating preference inference with social welfare optimization in multi-human AI systems, with implications for safer, more cooperative AI in real-world, multi-user environments.

Abstract

Assistance games (also known as cooperative inverse reinforcement learning games) have been proposed as a model for beneficial AI, wherein a robotic agent must act on behalf of a human principal but is initially uncertain about the humans payoff function. This paper studies multi-principal assistance games, which cover the more general case in which the robot acts on behalf of N humans who may have widely differing payoffs. Impossibility theorems in social choice theory and voting theory can be applied to such games, suggesting that strategic behavior by the human principals may complicate the robots task in learning their payoffs. We analyze in particular a bandit apprentice game in which the humans act first to demonstrate their individual preferences for the arms and then the robot acts to maximize the sum of human payoffs. We explore the extent to which the cost of choosing suboptimal arms reduces the incentive to mislead, a form of natural mechanism design. In this context we propose a social choice method that uses shared control of a system to combine preference inference with social welfare optimization.

Multi-Principal Assistance Games

TL;DR

The paper addresses value alignment in settings with multiple human principals, where diverse payoffs and strategic behavior impede learning. It analyzes impossibility results for learning from multiple humans via MPAL and proposes mechanism-design-inspired alternatives, including MPBA and voting-by-demonstrating, to align actions with social welfare. Through theoretical results and simulations, it shows manipulation risks under standard IRL and demonstrates how carefully designed shared-control mechanisms can mitigate manipulation while delivering near-optimal social outcomes. The work provides a principled framework for integrating preference inference with social welfare optimization in multi-human AI systems, with implications for safer, more cooperative AI in real-world, multi-user environments.

Abstract

Assistance games (also known as cooperative inverse reinforcement learning games) have been proposed as a model for beneficial AI, wherein a robotic agent must act on behalf of a human principal but is initially uncertain about the humans payoff function. This paper studies multi-principal assistance games, which cover the more general case in which the robot acts on behalf of N humans who may have widely differing payoffs. Impossibility theorems in social choice theory and voting theory can be applied to such games, suggesting that strategic behavior by the human principals may complicate the robots task in learning their payoffs. We analyze in particular a bandit apprentice game in which the humans act first to demonstrate their individual preferences for the arms and then the robot acts to maximize the sum of human payoffs. We explore the extent to which the cost of choosing suboptimal arms reduces the incentive to mislead, a form of natural mechanism design. In this context we propose a social choice method that uses shared control of a system to combine preference inference with social welfare optimization.

Paper Structure

This paper contains 16 sections, 10 theorems, 9 equations, 1 figure.

Key Result

Theorem 1

If $IRL$ returns a distribution over reward, then the mechanism defined by $\mathcal{M}(\xi^1,...,\xi^N) = RL \circ \mathbb{E}[W(IRL(\xi^1),...,IRL(\xi^N))]$ maximizes the expected value function over the induced distribution of MDPs.

Figures (1)

  • Figure 1: Manipulating a Multi-Agent Alignment IRL Method using a QP in a 2D $5 \times 6$ Gridworld Environment with a 3D feature space. First row: True reward of humans 1 and 2; State visitation count of optimal (resp. best-response) trajectories of human 2 (the initial state is in the bottom left-hand corner). Second row: Recovered rewards using IRL on the aggregate of first human's optimal and second human's optimal (resp. best-response) trajectories; Optimal robot trajectories in the MDP induced by these rewards.

Theorems & Definitions (15)

  • Example 1: MPAL via IRL
  • Theorem 1
  • Example 2: Voting
  • Definition 1: Straightforward Mechanism
  • Theorem 2: Based on Gibbard 1973
  • Theorem 3: Based on Gibbard 1978
  • Theorem 4
  • Theorem 5
  • Theorem 6
  • Theorem 7
  • ...and 5 more