Table of Contents
Fetching ...

Off-Dynamics Reinforcement Learning via Domain Adaptation and Reward Augmented Imitation

Yihong Guo, Yixuan Wang, Yuanyuan Shi, Pan Xu, Anqi Liu

TL;DR

This work proposes to utilize imitation learning to transfer the policy learned from the reward modification to the target domain so that the new policy can generate the same trajectories in the target domain.

Abstract

Training a policy in a source domain for deployment in the target domain under a dynamics shift can be challenging, often resulting in performance degradation. Previous work tackles this challenge by training on the source domain with modified rewards derived by matching distributions between the source and the target optimal trajectories. However, pure modified rewards only ensure the behavior of the learned policy in the source domain resembles trajectories produced by the target optimal policies, which does not guarantee optimal performance when the learned policy is actually deployed to the target domain. In this work, we propose to utilize imitation learning to transfer the policy learned from the reward modification to the target domain so that the new policy can generate the same trajectories in the target domain. Our approach, Domain Adaptation and Reward Augmented Imitation Learning (DARAIL), utilizes the reward modification for domain adaptation and follows the general framework of generative adversarial imitation learning from observation (GAIfO) by applying a reward augmented estimator for the policy optimization step. Theoretically, we present an error bound for our method under a mild assumption regarding the dynamics shift to justify the motivation of our method. Empirically, our method outperforms the pure modified reward method without imitation learning and also outperforms other baselines in benchmark off-dynamics environments.

Off-Dynamics Reinforcement Learning via Domain Adaptation and Reward Augmented Imitation

TL;DR

This work proposes to utilize imitation learning to transfer the policy learned from the reward modification to the target domain so that the new policy can generate the same trajectories in the target domain.

Abstract

Training a policy in a source domain for deployment in the target domain under a dynamics shift can be challenging, often resulting in performance degradation. Previous work tackles this challenge by training on the source domain with modified rewards derived by matching distributions between the source and the target optimal trajectories. However, pure modified rewards only ensure the behavior of the learned policy in the source domain resembles trajectories produced by the target optimal policies, which does not guarantee optimal performance when the learned policy is actually deployed to the target domain. In this work, we propose to utilize imitation learning to transfer the policy learned from the reward modification to the target domain so that the new policy can generate the same trajectories in the target domain. Our approach, Domain Adaptation and Reward Augmented Imitation Learning (DARAIL), utilizes the reward modification for domain adaptation and follows the general framework of generative adversarial imitation learning from observation (GAIfO) by applying a reward augmented estimator for the policy optimization step. Theoretically, we present an error bound for our method under a mild assumption regarding the dynamics shift to justify the motivation of our method. Empirically, our method outperforms the pure modified reward method without imitation learning and also outperforms other baselines in benchmark off-dynamics environments.

Paper Structure

This paper contains 32 sections, 4 theorems, 46 equations, 15 figures, 12 tables, 1 algorithm.

Key Result

Theorem 4.1

Let $m$ be the number of the expert demonstration and $\hat{\mathcal{R}}^{(m)}_{\pi} = \mathbb{E}_{\sigma}\left[\sup_{D\in \mathcal{D}} \frac{1}{m}\sum_{i = 1}^{m} \sigma_i D(s_t,s_{t+1}) \right]$ be the empirical Rademacher complexity. Let $B$ be the error bound of DARC in the source domain, i.e. $

Figures (15)

  • Figure 1: (a) Training reward in the source domain, i.e. $\mathbb{E}_{\pi_{\text{DARC},p_{\text{src}}}} [\sum_t r(s_t,a_t)]$, evaluation reward in the target domain, i.e. $\mathbb{E}_{\pi_{\text{DARC},p_{\text{trg}}}} [\sum_t r(s_t,a_t)]$ and optimal reward in target domain, for DARC in Ant. Evaluating the trained DARC policy in the target domain will cause performance degradation compared with its training reward, which should be close to the optimal reward in the target given DARC's objective function. Results of HalfCheetah, Walker2d, and Reacher are in Figure \ref{['fig:gap_between_darc_on_target_source']} in Appendix. (b) Learning framework of DARAIL. DARC Training: we first train the DARC in the source domain with a modified reward that is derived from the minimization of the reverse divergence between optimal policies on target and learned policies on the source. Details of DARC and the modified reward are in Section \ref{['section: introduction of darc']} and Appendix \ref{['appendix: darc objective']}. Discriminator training: the discriminator is trained to classify whether the data is from the expert demonstration (DARC trajectories) and provide a local reward function for policy learning. Generator training: the policy is updated with augmented reward estimation, which integrates the reward from the source domain and information from the discriminator. We first train DARC, collect DARC trajectories from the source domain, and then train the discriminator and the generator alternatively.
  • Figure 2: Performance of DARAIL and IPS-ACL on HalfCheetah and Walker2d under different importance weight clipping intervals. DARAIL outperforms IPS-ACL on all tasks. In Table \ref{['table: exp result broken src']}, IPS-ACL receives comparable performance with DARAIL with the clipping interval [0.01,100], while the performance decreases significantly with different intervals.
  • Figure 3: Performance of DARC and DARAIL under different off-dynamics shifts on Ant. Action $0$ is frozen (set to be 0) with probability $p_f$ in the source domain. From left to right, the off-dynamics shift becomes larger. As the shift becomes larger, the gap between DARC Training and DARC Evaluation is larger. Our method outperforms DARC on different dynamics shift.
  • Figure 4: Optimization objective of DARC. DARC minimizes the reverse KL divergence of the trajectories generated by the $\pi_{\text{DARC}}$ and optimal policy $\pi^*$.
  • Figure 5: Training reward in the source domain, i.e. $\mathbb{E}_{\pi_{\text{DARC},p_{\text{src}}}} [\sum_t r(s_t,a_t)]$, and evaluation reward in the target domain , i.e. $\mathbb{E}_{\pi_{\text{DARC},p_{\text{trg}}}} [\sum_t r(s_t,a_t)]$, for DARC in Ant and Walker2d with different broken probability $p_f$ in the source domain. (a) and (c) shows the performance of DARC under $p_f = 0.8$, and (a) and (c) shows the performance of DARC under $p_f = 1.0$. The performance of DARC under $p_f = 1.0$ is much worse than the case $p_f = 0.8$, and the performance gap between DARC in the source and target is larger, showing that the dynamics shift is overly large to adapt and learn a good expert demonstration.
  • ...and 10 more figures

Theorems & Definitions (8)

  • Theorem 4.1
  • Remark 4.2
  • Lemma A.1
  • Definition B.1
  • Definition B.2
  • Definition B.3
  • Lemma B.4
  • Theorem B.5