Table of Contents
Fetching ...

Rethinking Inverse Reinforcement Learning: from Data Alignment to Task Alignment

Weichao Zhou, Wenchao Li

TL;DR

A novel framework for IRL-based IL that prioritizes task alignment over conventional data alignment and adopts an adversarial mechanism to train a policy with this set of reward functions to gain a collective validation of the policy's ability to accomplish the task.

Abstract

Many imitation learning (IL) algorithms use inverse reinforcement learning (IRL) to infer a reward function that aligns with the demonstration. However, the inferred reward functions often fail to capture the underlying task objectives. In this paper, we propose a novel framework for IRL-based IL that prioritizes task alignment over conventional data alignment. Our framework is a semi-supervised approach that leverages expert demonstrations as weak supervision to derive a set of candidate reward functions that align with the task rather than only with the data. It then adopts an adversarial mechanism to train a policy with this set of reward functions to gain a collective validation of the policy's ability to accomplish the task. We provide theoretical insights into this framework's ability to mitigate task-reward misalignment and present a practical implementation. Our experimental results show that our framework outperforms conventional IL baselines in complex and transfer learning scenarios.

Rethinking Inverse Reinforcement Learning: from Data Alignment to Task Alignment

TL;DR

A novel framework for IRL-based IL that prioritizes task alignment over conventional data alignment and adopts an adversarial mechanism to train a policy with this set of reward functions to gain a collective validation of the policy's ability to accomplish the task.

Abstract

Many imitation learning (IL) algorithms use inverse reinforcement learning (IRL) to infer a reward function that aligns with the demonstration. However, the inferred reward functions often fail to capture the underlying task objectives. In this paper, we propose a novel framework for IRL-based IL that prioritizes task alignment over conventional data alignment. Our framework is a semi-supervised approach that leverages expert demonstrations as weak supervision to derive a set of candidate reward functions that align with the task rather than only with the data. It then adopts an adversarial mechanism to train a policy with this set of reward functions to gain a collective validation of the policy's ability to accomplish the task. We provide theoretical insights into this framework's ability to mitigate task-reward misalignment and present a practical implementation. Our experimental results show that our framework outperforms conventional IL baselines in complex and transfer learning scenarios.

Paper Structure

This paper contains 34 sections, 30 theorems, 33 equations, 8 figures, 3 tables, 3 algorithms.

Key Result

Proposition 1

Given the policy order $\preceq_{task}$ of a task, for any two reward functions $r_1, r_2$, if $\{\pi\ |\ U_{r_1}(\pi)\geq \overline{U}_{r_1}\}\subseteq \{\pi\ |\ U_{r_2}(\pi)\geq \overline{U}_{r_2}\}$, then there must exist policies $\pi_1\in \{\pi\ |\ U_{r_1}(\pi)\geq \overline{U}_{r_1}\}, \pi_

Figures (8)

  • Figure 1: (a) The two bars respectively represent the policy utility spaces of a task-aligned reward function $r^+$ and a task-misaligned reward function $r^-$. The white color indicates the utilities of acceptable policies, and the blue color indicates the unacceptable ones. Within the utility space of $r^+$, the utilities of all acceptable policies are higher ($\geq \underline{U}_{r^+}$) than those of the unacceptable ones, and the policies with utilities higher than $\overline{U}_{r^+}$ have higher orders than those of utilities lower than $\overline{U}_{r^+}$. Within the utility space of $r^-$, acceptable and unacceptable policies' utilities are mixed together, leading to a low $\underline{U}_{r^-}$ and an even lower $\overline{U}_{r^-}$ . (b) IRL-based IL relies solely on IRL's optimal reward function $r^*$ which can be task-misaligned and lead to an unacceptable policy $\pi_{r^*}\in \Pi\backslash\Pi_{acc}$ while PAGAR-based IL learns an acceptable policy $\pi^*\in\Pi_{acc}$ from a set $R_{E,\delta}$ of reward functions.
  • Figure 2: Comparing Algorithm \ref{['alg:pagar2_1']} with baselines in partial observable navigation tasks. The suffix after each 'PAGAR-' indicates which IRL technique is used in Algorithm 1. The $y$ axis indicates the average return per episode. The $x$ axis indicates the number of time steps.
  • Figure 3: PAGAR-GAIL in different reward spaces
  • Figure 4: Comparing Algorithm \ref{['alg:pagar2_1']} with f-IRL in continuous control tasks. 'PAGAR-fIRL' indicates f-IRL is used as the inverse RL algorithm in Algorithm 1. The $y$ axis indicates the average return per episode. The $x$ axis indicates the number of time steps in the environment.
  • Figure 5: Left: Consider an MDP where there are two available actions $a_1,a_2$ at initial state $s_0$. In other states, actions make no difference: the transition probabilities are either annotated at the transition edges or equal $1$ by default. States $s_3$ and $s_6$ are terminal states. Expert demonstrations are in $E$. Middle: x-axis indicates the MaxEnt IRL loss bound $\delta$ for $R_{E,\delta}$ as defined in Section \ref{['subsec:app_a_1']}. The y-axis indicates the probability of the protagonist policy learned via $MinimaxRegret(R_{E,\delta})$ choosing $a_2$ at $s_0$. The red curve shows how different $\delta$'s lead to different protagonist policies. The blue dashed curve is for reference, showing the optimal policy under the optimal reward learned via MaxEnt IRL. Right: The curve shows how the MaxEnt IRL Loss changes with $\omega$.
  • ...and 3 more figures

Theorems & Definitions (66)

  • Definition 1: Task
  • Definition 2: Task-Aligned Reward Functions
  • Proposition 1
  • Theorem 1
  • Definition 3: Mitigation of Task-Reward Misalignment
  • Definition 4: Protagonist Antagonist Guided Adversarial Reward (PAGAR)
  • Theorem 2: Weak Acceptance
  • Theorem 3: Strong Acceptance
  • Proposition 2
  • Theorem 4
  • ...and 56 more