Table of Contents
Fetching ...

Trust Region Reward Optimization and Proximal Inverse Reward Optimization Algorithm

Yang Chen, Menglin Zou, Jiaqi Zhang, Yitan Zhang, Junyi Yang, Gael Gendron, Libo Zhang, Jiamou Liu, Michael J. Witbrock

TL;DR

The paper tackles instability in IRL by introducing TRRO, a non-adversarial MM-based framework that guarantees monotonic improvement in expert imitation, and deriving PIRO as a practical algorithm with adaptive reward updates. It unifies non-adversarial IRL under likelihood maximization and provides a theoretical stability guarantee analogous to TRPO for forward RL. Empirically, PIRO achieves strong reward recovery, robust policy imitation, high sample efficiency, and effective reward transfer on MuJoCo, Gym-Robotics, and a real-world meerkat dataset, while maintaining stability across tasks. The work bridges theory and practice in IRL, offering a scalable, stable alternative to adversarial methods with broad applicability to robust reward learning and transfer under dynamics shifts.

Abstract

Inverse Reinforcement Learning (IRL) learns a reward function to explain expert demonstrations. Modern IRL methods often use the adversarial (minimax) formulation that alternates between reward and policy optimization, which often lead to unstable training. Recent non-adversarial IRL approaches improve stability by jointly learning reward and policy via energy-based formulations but lack formal guarantees. This work bridges this gap. We first present a unified view showing canonical non-adversarial methods explicitly or implicitly maximize the likelihood of expert behavior, which is equivalent to minimizing the expected return gap. This insight leads to our main contribution: Trust Region Reward Optimization (TRRO), a framework that guarantees monotonic improvement in this likelihood via a Minorization-Maximization process. We instantiate TRRO into Proximal Inverse Reward Optimization (PIRO), a practical and stable IRL algorithm. Theoretically, TRRO provides the IRL counterpart to the stability guarantees of Trust Region Policy Optimization (TRPO) in forward RL. Empirically, PIRO matches or surpasses state-of-the-art baselines in reward recovery, policy imitation with high sample efficiency on MuJoCo and Gym-Robotics benchmarks and a real-world animal behavior modeling task.

Trust Region Reward Optimization and Proximal Inverse Reward Optimization Algorithm

TL;DR

The paper tackles instability in IRL by introducing TRRO, a non-adversarial MM-based framework that guarantees monotonic improvement in expert imitation, and deriving PIRO as a practical algorithm with adaptive reward updates. It unifies non-adversarial IRL under likelihood maximization and provides a theoretical stability guarantee analogous to TRPO for forward RL. Empirically, PIRO achieves strong reward recovery, robust policy imitation, high sample efficiency, and effective reward transfer on MuJoCo, Gym-Robotics, and a real-world meerkat dataset, while maintaining stability across tasks. The work bridges theory and practice in IRL, offering a scalable, stable alternative to adversarial methods with broad applicability to robust reward learning and transfer under dynamics shifts.

Abstract

Inverse Reinforcement Learning (IRL) learns a reward function to explain expert demonstrations. Modern IRL methods often use the adversarial (minimax) formulation that alternates between reward and policy optimization, which often lead to unstable training. Recent non-adversarial IRL approaches improve stability by jointly learning reward and policy via energy-based formulations but lack formal guarantees. This work bridges this gap. We first present a unified view showing canonical non-adversarial methods explicitly or implicitly maximize the likelihood of expert behavior, which is equivalent to minimizing the expected return gap. This insight leads to our main contribution: Trust Region Reward Optimization (TRRO), a framework that guarantees monotonic improvement in this likelihood via a Minorization-Maximization process. We instantiate TRRO into Proximal Inverse Reward Optimization (PIRO), a practical and stable IRL algorithm. Theoretically, TRRO provides the IRL counterpart to the stability guarantees of Trust Region Policy Optimization (TRPO) in forward RL. Empirically, PIRO matches or surpasses state-of-the-art baselines in reward recovery, policy imitation with high sample efficiency on MuJoCo and Gym-Robotics benchmarks and a real-world animal behavior modeling task.

Paper Structure

This paper contains 30 sections, 11 theorems, 82 equations, 15 figures, 5 tables, 3 algorithms.

Key Result

Proposition 1

The log-likelihood objective $\ell({\bm\theta})$ in (eq:ML-IRL) has the following equivalent form that implies the expression of its gradient:

Figures (15)

  • Figure 1: Theoretical (top) and practical (bottom) contributions.Top: PPO -- rooted in TRPO's theory of monotonic policy improvement -- has been (one of) the most successful RL algorithm(s). This work is motivated by a dualism: the mathematical beauty of TRPO should not exist in isolation, but in conjugation with its inverse problem space. We identify and formalize this inverse counterpart, completing the "right half" of this "symmetric picture". We believe this contribution advances RL theory and opens new avenues for designing robust IRL algorithms. See Sec. \ref{['sec:TRRO']} for theoretical justifications. Bottom: PIRO, our practical algorithm, achieves a three-way balance among learning stability, imitation performance, and sample efficiency. To our knowledge, PIRO is the first IRL method that achieves state-of-the-art performance in imitation performance and learning stability with high sample efficiency. See Sec. \ref{['sec:PRO']} for the practical algorithm design and Sec. \ref{['sec:experiments']} for experiments.
  • Figure 2: Comparing Adversarial IRL, Non-adversarial IRL and our Trust Region Reward Optimization (TRRO). (a) Adversarial IRL methods frame reward learning as a game against a (nearly) best-response policy, often resulting in unstable training dynamics due to the inherent minimax structure. (b) Non-adversarial IRL methods bypass this game setup by coupling reward and policy via energy-based formulations and jointly update them by minimizing the expected return gap (a.k.a. the imitation gap). However, lacking principled control over reward update makes them sensitive to optimization errors. (c) TRRO reformulates non-adversarial IRL as a majorization-minimization (MM) process that identifies a trusted reward update in each step. This ensures a monotonic reduction in imitation gap and providing, to our knowledge, the first formal stability guarantee in IRL. ( Note: This is a theoretical comparison assuming exact policy computation.)
  • Figure 3: Illustration of the mechanism of Trust Region Reward Optimization (TRRO). The reward optimization follows a Minorization-Maximization process, iteratively optimizing a surrogate function that minorizes the original likelihood objective, thereby guaranteeing monotonic improvement in the likelihood of expert demonstrations (assuming exact policy optimization).
  • Figure 4: Reward curves of algorithms on MuJoCo locomotion tasks and Gym Robotics tasks.
  • Figure 5: Experiments on reward recovery in tasks with state-only rewards.Left: The task is a $7\times7$ grid world, where the agent starts from a random initial position (blue circles) with the objective of reaching the target position (red star) via the shortest possible path. Right: The ground truth reward at each position is defined as the negative Euclidean distance to the terminal state. Middle: The reward recovered by PIRO and the ground-truth reward function is highly consistent with the ground truth reward. Cumulative rewards: $-9.24$ (expert) vs. $-8.48$ (PIRO).
  • ...and 10 more figures

Theorems & Definitions (20)

  • Proposition 1: Lemma 1 in zeng2022maximum
  • Proposition 2
  • proof
  • Theorem 3
  • proof
  • Corollary 4
  • Lemma 5
  • proof
  • Lemma 6
  • proof
  • ...and 10 more