PAGAR: Taming Reward Misalignment in Inverse Reinforcement Learning-Based Imitation Learning with Protagonist Antagonist Guided Adversarial Reward
Weichao Zhou, Wenchao Li
TL;DR
Reward misalignment in IRL-based imitation learning can cause task failures when the inferred reward does not reflect the true objective. PAGAR introduces a semi-supervised reward design that optimizes a protagonist policy over a set of task-aligned rewards while competing against an antagonist under a minimax objective, effectively training under a mixture of rewards. The framework provides theoretical conditions for avoiding task failure and details an on-and-off policy algorithm that integrates IRL components, achieving superior performance and sample efficiency on challenging, partially observable, and transfer tasks. This approach enhances robustness to reward misspecification and offers practical pathways for deploying IRL-based IL in real-world settings where the task objective is unknown or noisy.
Abstract
Many imitation learning (IL) algorithms employ inverse reinforcement learning (IRL) to infer the intrinsic reward function that an expert is implicitly optimizing for based on their demonstrated behaviors. However, in practice, IRL-based IL can fail to accomplish the underlying task due to a misalignment between the inferred reward and the objective of the task. In this paper, we address the susceptibility of IL to such misalignment by introducing a semi-supervised reward design paradigm called Protagonist Antagonist Guided Adversarial Reward (PAGAR). PAGAR-based IL trains a policy to perform well under mixed reward functions instead of a single reward function as in IRL-based IL. We identify the theoretical conditions under which PAGAR-based IL can avoid the task failures caused by reward misalignment. We also present a practical on-and-off policy approach to implementing PAGAR-based IL. Experimental results show that our algorithm outperforms standard IL baselines in complex tasks and challenging transfer settings.
