Table of Contents
Fetching ...

Transferable Reward Learning by Dynamics-Agnostic Discriminator Ensemble

Fan-Ming Luo, Xingchen Cao, Rong-Jun Qin, Yang Yu

TL;DR

This work tackles the challenge of learning rewards that generalize across environments with different dynamics. It proposes DARL, a dynamics-agnostic discriminator ensemble that decouples reward signals from dynamics via mutual information minimization and leverages an ensemble of past discriminators to eliminate policy dependency. Empirical results on MuJoCo dynamics-transfer tasks show DARL yields rewards closely aligned with true environment rewards and enables superior policy performance across a wide range of transfer scenarios, including action-dependent rewards. The approach offers a practical, transferable IRL framework with theoretical guarantees and robust empirical performance, highlighting its potential to improve apprenticeship learning in variable dynamics settings.

Abstract

Recovering reward function from expert demonstrations is a fundamental problem in reinforcement learning. The recovered reward function captures the motivation of the expert. Agents can imitate experts by following these reward functions in their environment, which is known as apprentice learning. However, the agents may face environments different from the demonstrations, and therefore, desire transferable reward functions. Classical reward learning methods such as inverse reinforcement learning (IRL) or, equivalently, adversarial imitation learning (AIL), recover reward functions coupled with training dynamics, which are hard to be transferable. Previous dynamics-agnostic reward learning methods rely on assumptions such as that the reward function has to be state-only, restricting their applicability. In this work, we present a dynamics-agnostic discriminator-ensemble reward learning method (DARL) within the AIL framework, capable of learning both state-action and state-only reward functions. DARL achieves this by decoupling the reward function from training dynamics, employing a dynamics-agnostic discriminator on a latent space derived from the original state-action space. This latent space is optimized to minimize information on the dynamics. We moreover discover the policy-dependency issue of the AIL framework that reduces the transferability. DARL represents the reward function as an ensemble of discriminators during training to eliminate policy dependencies. Empirical studies on MuJoCo tasks with changed dynamics show that DARL better recovers the reward function and results in better imitation performance in transferred environments, handling both state-only and state-action reward scenarios.

Transferable Reward Learning by Dynamics-Agnostic Discriminator Ensemble

TL;DR

This work tackles the challenge of learning rewards that generalize across environments with different dynamics. It proposes DARL, a dynamics-agnostic discriminator ensemble that decouples reward signals from dynamics via mutual information minimization and leverages an ensemble of past discriminators to eliminate policy dependency. Empirical results on MuJoCo dynamics-transfer tasks show DARL yields rewards closely aligned with true environment rewards and enables superior policy performance across a wide range of transfer scenarios, including action-dependent rewards. The approach offers a practical, transferable IRL framework with theoretical guarantees and robust empirical performance, highlighting its potential to improve apprenticeship learning in variable dynamics settings.

Abstract

Recovering reward function from expert demonstrations is a fundamental problem in reinforcement learning. The recovered reward function captures the motivation of the expert. Agents can imitate experts by following these reward functions in their environment, which is known as apprentice learning. However, the agents may face environments different from the demonstrations, and therefore, desire transferable reward functions. Classical reward learning methods such as inverse reinforcement learning (IRL) or, equivalently, adversarial imitation learning (AIL), recover reward functions coupled with training dynamics, which are hard to be transferable. Previous dynamics-agnostic reward learning methods rely on assumptions such as that the reward function has to be state-only, restricting their applicability. In this work, we present a dynamics-agnostic discriminator-ensemble reward learning method (DARL) within the AIL framework, capable of learning both state-action and state-only reward functions. DARL achieves this by decoupling the reward function from training dynamics, employing a dynamics-agnostic discriminator on a latent space derived from the original state-action space. This latent space is optimized to minimize information on the dynamics. We moreover discover the policy-dependency issue of the AIL framework that reduces the transferability. DARL represents the reward function as an ensemble of discriminators during training to eliminate policy dependencies. Empirical studies on MuJoCo tasks with changed dynamics show that DARL better recovers the reward function and results in better imitation performance in transferred environments, handling both state-only and state-action reward scenarios.
Paper Structure (34 sections, 4 theorems, 52 equations, 17 figures, 15 tables, 2 algorithms)

This paper contains 34 sections, 4 theorems, 52 equations, 17 figures, 15 tables, 2 algorithms.

Key Result

Theorem 1

Let $q(s'| z;\theta)$ be a variational approximation of $p(s'| z)$ with parameter $\theta$. Denote $q(z,s';\theta)=q(s'| z;\theta)p(z)$. If then $I(z;s') \leq I_\text{vCLUB}(z;s')$, where

Figures (17)

  • Figure 1: The framework of dynamics-agnostic discriminator learning by mutual information minimization. The input of the discriminator comprises embeddings of the state-action pairs, generated by a state encoder and an action encoder. The optimization target is ensure that these embeddings have minimal minimum mutual information with the next state. The optimization objective is maximizing the prediction errors of a transition model while minimizing the prediction errors associated with incorrect labels.
  • Figure 2: Classification precision and derived reward of the discriminators learned successively in a GAIL training process. The precision and the derived reward are evaluated on a set of data with various returns.
  • Figure 3: Derived return and classification precision of the discriminators or discriminator ensembles learned by various methods.
  • Figure 4: Normalized derived return and true environment return vs. policy training iteration in HalfCheetah. We train a policy using the ensemble discriminator (a) or the last discriminator (b) as the reward function. The left sub-figures in both panels depict the return derived by each discriminator for the policy at every iteration. Specifically, the point $(x,y)$, marked with color $c$, represents the normalized return derived by the $y$-th discriminator for the policy learned in the $x$-th iteration. The right sub-figures in both panels display the normalized returns derived by different reward models and the true environment returns vs. the policy training iterations.
  • Figure 5: True environment returns and derived returns vs. gravity. We trained a policy in HalfCheetah with gravity perturbations and sampled with this policy under different gravity. The figure is plotted using the trajectories with performance meeting $75\%$ expert return.
  • ...and 12 more figures

Theorems & Definitions (7)

  • Theorem 1: Variational Contrastive Log-Ratio Upper Bound cheng2020vclub
  • Theorem 2: Discriminator Ensemble Upper Bound
  • proof
  • Theorem 3: Discriminator Ensemble Upper Bound
  • proof
  • Theorem 4: Variational Contrastive Log-Ratio Upper Bound cheng2020vclub
  • proof