Table of Contents
Fetching ...

Inference of Altruism and Intrinsic Rewards in Multi-Agent Systems

Victor Villin, Christos Dimitrakakis

TL;DR

This paper tackles the challenge of inferring altruism and intrinsic rewards in multi-agent systems by introducing altruism-structured rewards within MAIRL. It shows that observing agents across multiple interaction groups can resolve reward identifiability, and presents two Bayesian methods, DRP and PORP, to infer both intrinsic rewards and altruism levels without relying on strict rationality assumptions. The approach is validated on challenging random Markov games and a collaborative cooking task, demonstrating reliable disentanglement of motives and the ability to synthesize behaviours at any desired altruism level. The work advances interpretability, trustworthiness, and social alignment of autonomous agents operating in human-centric environments, with practical implications for team management and adaptive human-AI collaboration.

Abstract

Human interactions are influenced by emotions, temperament, and affection, often conflicting with individuals' underlying preferences. Without explicit knowledge of those preferences, judging whether behaviour is appropriate becomes guesswork, leaving us highly prone to misinterpretation. Yet, such understanding is critical if autonomous agents are to collaborate effectively with humans. We frame the problem with multi-agent inverse reinforcement learning and show that even a simple model, where agents weigh their own welfare against that of others, can cover a wide range of social behaviours. Using novel Bayesian techniques, we find that intrinsic rewards and altruistic tendencies can be reliably identified by placing agents in different groups. Crucially, this disentanglement of intrinsic motivation from altruism enables the synthesis of new behaviours aligned with any desired level of altruism, even when demonstrations are drawn from restricted behaviour profiles.

Inference of Altruism and Intrinsic Rewards in Multi-Agent Systems

TL;DR

This paper tackles the challenge of inferring altruism and intrinsic rewards in multi-agent systems by introducing altruism-structured rewards within MAIRL. It shows that observing agents across multiple interaction groups can resolve reward identifiability, and presents two Bayesian methods, DRP and PORP, to infer both intrinsic rewards and altruism levels without relying on strict rationality assumptions. The approach is validated on challenging random Markov games and a collaborative cooking task, demonstrating reliable disentanglement of motives and the ability to synthesize behaviours at any desired altruism level. The work advances interpretability, trustworthiness, and social alignment of autonomous agents operating in human-centric environments, with practical implications for team management and adaptive human-AI collaboration.

Abstract

Human interactions are influenced by emotions, temperament, and affection, often conflicting with individuals' underlying preferences. Without explicit knowledge of those preferences, judging whether behaviour is appropriate becomes guesswork, leaving us highly prone to misinterpretation. Yet, such understanding is critical if autonomous agents are to collaborate effectively with humans. We frame the problem with multi-agent inverse reinforcement learning and show that even a simple model, where agents weigh their own welfare against that of others, can cover a wide range of social behaviours. Using novel Bayesian techniques, we find that intrinsic rewards and altruistic tendencies can be reliably identified by placing agents in different groups. Crucially, this disentanglement of intrinsic motivation from altruism enables the synthesis of new behaviours aligned with any desired level of altruism, even when demonstrations are drawn from restricted behaviour profiles.

Paper Structure

This paper contains 48 sections, 4 theorems, 39 equations, 8 figures, 7 tables, 2 algorithms.

Key Result

Proposition 1

Assume we observe a QRE ${\boldsymbol{\pi}}^*$ for the game $\mathcal{G}(\mathbf{R})$, and that we know the altruism levels of agents. Then, intrinsic rewards are identifiable up to potential shaping transformations $\tilde{r}_i(s,a) = r_i(s,a) + \delta{r_i}(s,a)$, with where $\phi: \mathcal{S} \rightarrow \mathbb{R}$ is any potential shaping function.

Figures (8)

  • Figure 1: The altruism scale model. We highlight three key values of $\lambda$. $\;\mathbf{-1}\;$: agent values its own welfare as much as it harms others. $\;\mathbf{0}\;$: agent ignores others' welfare. $\;\mathbf{1}\;$: agent values its own and others' welfare equally.
  • Figure 2: Graphical model of policy and altruism-structured rewards across groups. Latent nodes are white. Priors $\mathbb{P}(r)$ and $\mathbb{P}(\lambda)$ generate agent rewards $r_i$ and altruism $\lambda_i$, which determine group rewards $\mathbf{R}_\mathbf{g}$ and policies ${\boldsymbol{\pi}}_\mathbf{g}$, producing demonstrations $\mathcal{D}_\mathbf{g}$. Priors $\mathbb{P}(\pi)$ and $\mathbb{P}(\beta)$ also participate in generating rewards and policies. Coloured edges indicate how and which prior is used (red is for DRP, blue for PORP).
  • Figure 3: Sample efficiency of the proposed Bayesian inference methods on 4-player randomised games, using different numbers of groups. Error bars are provided in Appendix \ref{['apx:results']}.
  • Figure 4: Synthesis of policies from anti-social demonstrations in Overcooked, ranging from adversarial to altruistic.
  • Figure 5: Robustness of methods over varying optimality of demonstrations, and stochasticity belief.
  • ...and 3 more figures

Theorems & Definitions (6)

  • Proposition 1
  • Corollary 1
  • Corollary 2
  • Theorem 1
  • Definition 1: QRE Imitation Gap (QIG)
  • Definition 2: Policy Stability Gap (PSG)