Table of Contents
Fetching ...

PAGAR: Taming Reward Misalignment in Inverse Reinforcement Learning-Based Imitation Learning with Protagonist Antagonist Guided Adversarial Reward

Weichao Zhou, Wenchao Li

TL;DR

Reward misalignment in IRL-based imitation learning can cause task failures when the inferred reward does not reflect the true objective. PAGAR introduces a semi-supervised reward design that optimizes a protagonist policy over a set of task-aligned rewards while competing against an antagonist under a minimax objective, effectively training under a mixture of rewards. The framework provides theoretical conditions for avoiding task failure and details an on-and-off policy algorithm that integrates IRL components, achieving superior performance and sample efficiency on challenging, partially observable, and transfer tasks. This approach enhances robustness to reward misspecification and offers practical pathways for deploying IRL-based IL in real-world settings where the task objective is unknown or noisy.

Abstract

Many imitation learning (IL) algorithms employ inverse reinforcement learning (IRL) to infer the intrinsic reward function that an expert is implicitly optimizing for based on their demonstrated behaviors. However, in practice, IRL-based IL can fail to accomplish the underlying task due to a misalignment between the inferred reward and the objective of the task. In this paper, we address the susceptibility of IL to such misalignment by introducing a semi-supervised reward design paradigm called Protagonist Antagonist Guided Adversarial Reward (PAGAR). PAGAR-based IL trains a policy to perform well under mixed reward functions instead of a single reward function as in IRL-based IL. We identify the theoretical conditions under which PAGAR-based IL can avoid the task failures caused by reward misalignment. We also present a practical on-and-off policy approach to implementing PAGAR-based IL. Experimental results show that our algorithm outperforms standard IL baselines in complex tasks and challenging transfer settings.

PAGAR: Taming Reward Misalignment in Inverse Reinforcement Learning-Based Imitation Learning with Protagonist Antagonist Guided Adversarial Reward

TL;DR

Reward misalignment in IRL-based imitation learning can cause task failures when the inferred reward does not reflect the true objective. PAGAR introduces a semi-supervised reward design that optimizes a protagonist policy over a set of task-aligned rewards while competing against an antagonist under a minimax objective, effectively training under a mixture of rewards. The framework provides theoretical conditions for avoiding task failure and details an on-and-off policy algorithm that integrates IRL components, achieving superior performance and sample efficiency on challenging, partially observable, and transfer tasks. This approach enhances robustness to reward misspecification and offers practical pathways for deploying IRL-based IL in real-world settings where the task objective is unknown or noisy.

Abstract

Many imitation learning (IL) algorithms employ inverse reinforcement learning (IRL) to infer the intrinsic reward function that an expert is implicitly optimizing for based on their demonstrated behaviors. However, in practice, IRL-based IL can fail to accomplish the underlying task due to a misalignment between the inferred reward and the objective of the task. In this paper, we address the susceptibility of IL to such misalignment by introducing a semi-supervised reward design paradigm called Protagonist Antagonist Guided Adversarial Reward (PAGAR). PAGAR-based IL trains a policy to perform well under mixed reward functions instead of a single reward function as in IRL-based IL. We identify the theoretical conditions under which PAGAR-based IL can avoid the task failures caused by reward misalignment. We also present a practical on-and-off policy approach to implementing PAGAR-based IL. Experimental results show that our algorithm outperforms standard IL baselines in complex tasks and challenging transfer settings.
Paper Structure (33 sections, 19 theorems, 33 equations, 8 figures, 2 tables, 3 algorithms)

This paper contains 33 sections, 19 theorems, 33 equations, 8 figures, 2 tables, 3 algorithms.

Key Result

Lemma 1

The optimal solution $r^*$ of IRL is misaligned with the task specified by $\Phi$ iff $\Phi(\pi_{r^*})\equiv false$.

Figures (8)

  • Figure 1: Between the two reward functions, $r^-$ is misaligned with the task, and $r^+$ is aligned with the task. The vertical axis measures the ranges of $U_{r^+}(\pi)$ and $U_{r^-}(\pi)$ for $\forall \pi\in\Pi$. $\mathbb{U}_{r^+}$ is the interval $[\underset{\pi\in\Pi}{\min}\ U_{r^+}(\pi), \underset{\pi\in\Pi}{\max}\ U_{r^+}(\pi)]$ where $S_{r^+}$ and $F_{r^+}$ are two disjoint intervals such that any policy achieving higher utility than $\inf S_{r^+}$ can succeed in the task, and that achieving lower utility than $\sup F_{r^+}$ fails. $\mathbb{U}_{r^-}$ is the interval $[\underset{\pi\in\Pi}{\min}\ U_{r^-}(\pi), \underset{\pi\in\Pi}{\max}\ U_{r^-}(\pi)]$ where the $S$ and $F$ intervals do not exist. The reward function $r^-$ is the optimal solution for IRL since $E$ maximally outperforms any other policy under $r^-$. IRL-based IL learns its optimal policy $\pi^-$, which, however, has a low utility $U_{r^+}(\pi^-)\leq \sup F_{r^+}$ under $r^+$, thus failing the task. In contrast, $\pi^+$ performs consistently well under $r^+$ and $r^-$ as $E$ does.
  • Figure 2: Left: Consider an MDP where there are two available actions $a_1,a_2$ at initial state $s_0$. In other states, actions make no difference: the transition probabilities are either annotated at the transition edges or equal $1$ by default. States $s_3$ and $s_6$ are terminal states. Expert demonstrations are in $E$. Right: x-axis indicates the MaxEnt IRL loss bound $\delta$ for $R_{E,\delta}$ as defined in Section \ref{['subsec:pagar1_1']}. The y-axis indicates the probability of the protagonist policy learned via $MinimaxRegret(R_{E,\delta})$ choosing $a_2$ at $s_0$. The red curve shows how different $\delta$'s lead to different protagonist policies. The blue dashed curve is for reference, showing that the optimal policy under the optimal reward learned of MaxEnt IRL.
  • Figure 3: Comparing Algorithm \ref{['alg:pagar2_1']} with baselines in partial observable navigation tasks. The suffix after each 'PAGAR-' indicates which IRL technique is used in Algorithm 1. The $y$ axis indicates the average return per episode. The $x$ axis indicates the number of time steps.
  • Figure 4: Different reward spaces
  • Figure 5: (Left: Walker2d-v2. Right: HalfCheeta-v2) The $y$ axis indicates the average return per episode.
  • ...and 3 more figures

Theorems & Definitions (47)

  • Definition 1: Task-Reward Alignment
  • Lemma 1
  • Definition 2: Mitigate Reward Misalignment
  • Definition 3: Protagonist Antagonist Guided Adversarial Reward (PAGAR)
  • Example 1
  • Theorem 1: Task-Failure Avoidance
  • Theorem 2: Task-Success Guarantee
  • Corollary 1
  • Corollary 2
  • Proposition 1
  • ...and 37 more