Table of Contents
Fetching ...

RIZE: Adaptive Regularization for Imitation Learning

Adib Karimi, Mohammad Mehdi Ebadzadeh

TL;DR

A novel Inverse Reinforcement Learning method that mitigates the rigidity of fixed reward structures and the limited flexibility of implicit reward regularization by incorporating a squared temporal-difference regularizer with adaptive targets that evolve dynamically during training, thereby imposing adaptive bounds on recovered rewards and promoting robust decision-making.

Abstract

We propose a novel Inverse Reinforcement Learning (IRL) method that mitigates the rigidity of fixed reward structures and the limited flexibility of implicit reward regularization. Building on the Maximum Entropy IRL framework, our approach incorporates a squared temporal-difference (TD) regularizer with adaptive targets that evolve dynamically during training, thereby imposing adaptive bounds on recovered rewards and promoting robust decision-making. To capture richer return information, we integrate distributional RL into the learning process. Empirically, our method achieves expert-level performance on complex MuJoCo and Adroit environments, surpassing baseline methods on the Humanoid-v2 task with limited expert demonstrations. Extensive experiments and ablation studies further validate the effectiveness of the approach and provide insights into reward dynamics in imitation learning. Our source code is available at https://github.com/adibka/RIZE.

RIZE: Adaptive Regularization for Imitation Learning

TL;DR

A novel Inverse Reinforcement Learning method that mitigates the rigidity of fixed reward structures and the limited flexibility of implicit reward regularization by incorporating a squared temporal-difference regularizer with adaptive targets that evolve dynamically during training, thereby imposing adaptive bounds on recovered rewards and promoting robust decision-making.

Abstract

We propose a novel Inverse Reinforcement Learning (IRL) method that mitigates the rigidity of fixed reward structures and the limited flexibility of implicit reward regularization. Building on the Maximum Entropy IRL framework, our approach incorporates a squared temporal-difference (TD) regularizer with adaptive targets that evolve dynamically during training, thereby imposing adaptive bounds on recovered rewards and promoting robust decision-making. To capture richer return information, we integrate distributional RL into the learning process. Empirically, our method achieves expert-level performance on complex MuJoCo and Adroit environments, surpassing baseline methods on the Humanoid-v2 task with limited expert demonstrations. Extensive experiments and ablation studies further validate the effectiveness of the approach and provide insights into reward dynamics in imitation learning. Our source code is available at https://github.com/adibka/RIZE.

Paper Structure

This paper contains 32 sections, 4 theorems, 19 equations, 17 figures, 1 table, 1 algorithm.

Key Result

Proposition 4.1

Let $R_Q(s,a) = (\mathcal{T}^{\pi} Q)(s, a)$ denote the implicit reward derived from point-estimate Q-values, where $Q(s,a) = \mathbb{E}[Z(s,a)]$. Let $\rho_E(s,a)$ and $\rho_{\pi}(s,a)$ denote occupancy measures under $\pi_E$ and $\pi$, respectively. For fixed $\pi$, the optimal TD-regularized rewa

Figures (17)

  • Figure 1: RLiableagarwal2021deep plots for RIZE vs. BC, LSIQ, SQIL, CSIL, and IQ-Learn on six MuJoCo/Adroit tasks. For each setting (3 demos; 10 demos), we report aggregate Median, IQM, Mean, and Optimality Gap with 95% confidence intervals computed via percentile bootstrap stratified over tasks and five seeds. Scores are normalized to expert performance. Higher is better for Median, IQM, and Mean; lower is better for Optimality Gap.
  • Figure 2: Normalized returns on MuJoCo and Adroit tasks for RIZE and baselines. We first compute, per seed, the average episodic return over the final third of training steps; bars show the mean across five seeds and error bars denote the 95% confidence interval. Returns are normalized to expert performance and reported for both 3 and 10 expert demonstrations.
  • Figure 3: Learning curves on MuJoCo and Adroit tasks with 10 expert demonstrations. Lines show the mean normalized return across five seeds; shaded regions denote 95% confidence intervals.
  • Figure 4: Implicit reward curves for expert and policy samples on MuJoCo and Adroit tasks with 10 expert demonstrations. Each subplot reports the mean across five seeds, with shaded regions showing the 95% confidence interval. Theoretical upper and lower bounds derived in this work are overlaid as separate curves in each subplot.
  • Figure 5: Ablation on critic architecture: $Z(s,a)$ via Implicit Quantile Networks (IQN) Dabney2018 versus classic $Q(s,a)$. We report expert–normalized returns across all MuJoCo and Adroit tasks using three expert demonstrations; metrics show the mean over five seeds with 95% confidence intervals.
  • ...and 12 more figures

Theorems & Definitions (8)

  • Proposition 4.1
  • proof
  • Corollary 4.2
  • proof
  • Lemma A.1
  • proof
  • Corollary A.2
  • proof