Inference of Altruism and Intrinsic Rewards in Multi-Agent Systems
Victor Villin, Christos Dimitrakakis
TL;DR
This paper tackles the challenge of inferring altruism and intrinsic rewards in multi-agent systems by introducing altruism-structured rewards within MAIRL. It shows that observing agents across multiple interaction groups can resolve reward identifiability, and presents two Bayesian methods, DRP and PORP, to infer both intrinsic rewards and altruism levels without relying on strict rationality assumptions. The approach is validated on challenging random Markov games and a collaborative cooking task, demonstrating reliable disentanglement of motives and the ability to synthesize behaviours at any desired altruism level. The work advances interpretability, trustworthiness, and social alignment of autonomous agents operating in human-centric environments, with practical implications for team management and adaptive human-AI collaboration.
Abstract
Human interactions are influenced by emotions, temperament, and affection, often conflicting with individuals' underlying preferences. Without explicit knowledge of those preferences, judging whether behaviour is appropriate becomes guesswork, leaving us highly prone to misinterpretation. Yet, such understanding is critical if autonomous agents are to collaborate effectively with humans. We frame the problem with multi-agent inverse reinforcement learning and show that even a simple model, where agents weigh their own welfare against that of others, can cover a wide range of social behaviours. Using novel Bayesian techniques, we find that intrinsic rewards and altruistic tendencies can be reliably identified by placing agents in different groups. Crucially, this disentanglement of intrinsic motivation from altruism enables the synthesis of new behaviours aligned with any desired level of altruism, even when demonstrations are drawn from restricted behaviour profiles.
