Trust Region Reward Optimization and Proximal Inverse Reward Optimization Algorithm

Yang Chen; Menglin Zou; Jiaqi Zhang; Yitan Zhang; Junyi Yang; Gael Gendron; Libo Zhang; Jiamou Liu; Michael J. Witbrock

Trust Region Reward Optimization and Proximal Inverse Reward Optimization Algorithm

Yang Chen, Menglin Zou, Jiaqi Zhang, Yitan Zhang, Junyi Yang, Gael Gendron, Libo Zhang, Jiamou Liu, Michael J. Witbrock

TL;DR

The paper tackles instability in IRL by introducing TRRO, a non-adversarial MM-based framework that guarantees monotonic improvement in expert imitation, and deriving PIRO as a practical algorithm with adaptive reward updates. It unifies non-adversarial IRL under likelihood maximization and provides a theoretical stability guarantee analogous to TRPO for forward RL. Empirically, PIRO achieves strong reward recovery, robust policy imitation, high sample efficiency, and effective reward transfer on MuJoCo, Gym-Robotics, and a real-world meerkat dataset, while maintaining stability across tasks. The work bridges theory and practice in IRL, offering a scalable, stable alternative to adversarial methods with broad applicability to robust reward learning and transfer under dynamics shifts.

Abstract

Inverse Reinforcement Learning (IRL) learns a reward function to explain expert demonstrations. Modern IRL methods often use the adversarial (minimax) formulation that alternates between reward and policy optimization, which often lead to unstable training. Recent non-adversarial IRL approaches improve stability by jointly learning reward and policy via energy-based formulations but lack formal guarantees. This work bridges this gap. We first present a unified view showing canonical non-adversarial methods explicitly or implicitly maximize the likelihood of expert behavior, which is equivalent to minimizing the expected return gap. This insight leads to our main contribution: Trust Region Reward Optimization (TRRO), a framework that guarantees monotonic improvement in this likelihood via a Minorization-Maximization process. We instantiate TRRO into Proximal Inverse Reward Optimization (PIRO), a practical and stable IRL algorithm. Theoretically, TRRO provides the IRL counterpart to the stability guarantees of Trust Region Policy Optimization (TRPO) in forward RL. Empirically, PIRO matches or surpasses state-of-the-art baselines in reward recovery, policy imitation with high sample efficiency on MuJoCo and Gym-Robotics benchmarks and a real-world animal behavior modeling task.

Trust Region Reward Optimization and Proximal Inverse Reward Optimization Algorithm

TL;DR

Abstract

Trust Region Reward Optimization and Proximal Inverse Reward Optimization Algorithm

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (15)

Theorems & Definitions (20)