Table of Contents
Fetching ...

Cross-Domain Policy Adaptation by Capturing Representation Mismatch

Jiafei Lyu, Chenjia Bai, Jingwen Yang, Zongqing Lu, Xiu Li

TL;DR

This work tackles cross-domain reinforcement learning when source and target domains share state, action, and reward but differ in dynamics. It introduces PAR, a decoupled representation-learning approach that trains latent encoders in the target domain and uses representation deviations on source-domain transitions as a reward penalty to align dynamics. The authors establish theoretical connections between representation mismatch and dynamics mismatch, deriving both online and offline performance bounds, and implement PAR on SAC with optional offline behavior cloning. Empirically, PAR achieves strong performance across kinematic and morphology shifts, outperforming state-of-the-art baselines in online and offline settings and offering improved sample efficiency and runtime characteristics. The work highlights a practical, theory-backed pathway for dynamics-aware policy transfer without requiring abundant target-domain data or demonstrations, with limitations that motivate adaptive penalty tuning and broader-domain extensions.

Abstract

It is vital to learn effective policies that can be transferred to different domains with dynamics discrepancies in reinforcement learning (RL). In this paper, we consider dynamics adaptation settings where there exists dynamics mismatch between the source domain and the target domain, and one can get access to sufficient source domain data, while can only have limited interactions with the target domain. Existing methods address this problem by learning domain classifiers, performing data filtering from a value discrepancy perspective, etc. Instead, we tackle this challenge from a decoupled representation learning perspective. We perform representation learning only in the target domain and measure the representation deviations on the transitions from the source domain, which we show can be a signal of dynamics mismatch. We also show that representation deviation upper bounds performance difference of a given policy in the source domain and target domain, which motivates us to adopt representation deviation as a reward penalty. The produced representations are not involved in either policy or value function, but only serve as a reward penalizer. We conduct extensive experiments on environments with kinematic and morphology mismatch, and the results show that our method exhibits strong performance on many tasks. Our code is publicly available at https://github.com/dmksjfl/PAR.

Cross-Domain Policy Adaptation by Capturing Representation Mismatch

TL;DR

This work tackles cross-domain reinforcement learning when source and target domains share state, action, and reward but differ in dynamics. It introduces PAR, a decoupled representation-learning approach that trains latent encoders in the target domain and uses representation deviations on source-domain transitions as a reward penalty to align dynamics. The authors establish theoretical connections between representation mismatch and dynamics mismatch, deriving both online and offline performance bounds, and implement PAR on SAC with optional offline behavior cloning. Empirically, PAR achieves strong performance across kinematic and morphology shifts, outperforming state-of-the-art baselines in online and offline settings and offering improved sample efficiency and runtime characteristics. The work highlights a practical, theory-backed pathway for dynamics-aware policy transfer without requiring abundant target-domain data or demonstrations, with limitations that motivate adaptive penalty tuning and broader-domain extensions.

Abstract

It is vital to learn effective policies that can be transferred to different domains with dynamics discrepancies in reinforcement learning (RL). In this paper, we consider dynamics adaptation settings where there exists dynamics mismatch between the source domain and the target domain, and one can get access to sufficient source domain data, while can only have limited interactions with the target domain. Existing methods address this problem by learning domain classifiers, performing data filtering from a value discrepancy perspective, etc. Instead, we tackle this challenge from a decoupled representation learning perspective. We perform representation learning only in the target domain and measure the representation deviations on the transitions from the source domain, which we show can be a signal of dynamics mismatch. We also show that representation deviation upper bounds performance difference of a given policy in the source domain and target domain, which motivates us to adopt representation deviation as a reward penalty. The produced representations are not involved in either policy or value function, but only serve as a reward penalizer. We conduct extensive experiments on environments with kinematic and morphology mismatch, and the results show that our method exhibits strong performance on many tasks. Our code is publicly available at https://github.com/dmksjfl/PAR.
Paper Structure (31 sections, 11 theorems, 38 equations, 11 figures, 4 tables, 3 algorithms)

This paper contains 31 sections, 11 theorems, 38 equations, 11 figures, 4 tables, 3 algorithms.

Key Result

Theorem 4.2

For any $(s,a)$, denote its representation as $z$, and suppose $s_{\rm src}^\prime\sim P_{\mathcal{M}_{\rm src}}(\,\cdot\,|s,a)$, $s_{\rm tar}^\prime\sim P_{\mathcal{M}_{\rm tar}}(\,\cdot\,|s,a)$. Denote $h(z;s_{\rm src}^\prime,s_{\rm tar}^\prime)=I(z;s_{\rm tar}^\prime) - I(z;s_{\rm src}^\prime)$,

Figures (11)

  • Figure 1: Illustration of PAR. We train encoders $f,g$ merely with target domain data and utilize them to modify rewards from the source domain with measured representation deviations. Afterward, the downstream SAC algorithm can learn from transitions from both domains.
  • Figure 2: Adaptation performance comparison when the source domain is online. The curves depict the test performance of each algorithm in the target domain under kinematic shifts (top) and morphology shifts (bottom). The modification to the environment is specified in the parentheses of the task name. The solid lines are the average returns over 5 different random seeds and the shaded region captures the standard deviation. The dashed line of SAC-tune denotes its final performance after fine-tuning $10^5$ steps.
  • Figure 3: Parameter study of (a) reward penalty coefficient $\beta$, (b) target domain interaction interval $F$. Results are averaged over 5 seeds and the shaded region denotes the standard deviation.
  • Figure 4: Runtime comparison of different methods.
  • Figure 5: Performance comparison between PAR and PAR-B.
  • ...and 6 more figures

Theorems & Definitions (18)

  • Theorem 4.2
  • Theorem 4.3
  • Theorem 4.4: Online performance bound
  • Theorem 4.5: Offline performance bound
  • Theorem 1.1
  • proof
  • Theorem 1.2
  • proof
  • Theorem 1.3: Online performance bound
  • proof
  • ...and 8 more