Table of Contents
Fetching ...

Cross-Domain Offline Policy Adaptation with Dynamics- and Value-Aligned Data Filtering

Zhongjian Qiao, Rui Yang, Jiafei Lyu, Chenjia Bai, Xiu Li, Zhuoran Yang, Siyang Gao, Shuang Qiu

TL;DR

The paper tackles cross-domain offline RL under dynamics shifts by showing that both dynamics and value alignment are needed for effective transfer. It introduces DVDF, a plug-in data-filtering method that jointly evaluates dynamics alignment (via a contrastive score) and value alignment (via a pre-trained source-policy advantage) to selectively reuse source-domain data. DVDF integrates with strong baselines (IGDF/OTDF), uses a pre-trained SQL policy to estimate advantages, and weights source TD losses by a combined score, leading to consistent gains across multiple tasks and extremely low target-data scenarios. The theoretical bound and extensive experiments demonstrate DVDF’s ability to significantly improve target-domain performance by mitigating both dynamics mismatch and value misalignment.

Abstract

Cross-Domain Offline Reinforcement Learning aims to train an agent deployed in the target environment, leveraging both a limited target domain dataset and a source domain dataset with (possibly) sufficient data coverage. Due to the underlying dynamics misalignment between the source and target domain, simply merging the data from two datasets may incur inferior performance. Recent advances address this issue by selectively sharing source domain samples that exhibit dynamics alignment with the target domain. However, these approaches focus solely on dynamics alignment and overlook \textit{value alignment}, i.e., selecting high-quality, high-value samples from the source domain. In this paper, we first demonstrate that both dynamics alignment and value alignment are essential for policy learning, by examining the limitations of the current theoretical framework for cross-domain RL and establishing a concrete sub-optimality gap of a policy trained on the source domain and evaluated on the target domain. Motivated by the theoretical insights, we propose to selectively share those source domain samples with both high dynamics and value alignment and present our \textbf{\underline{D}}ynamics- and \textbf{\underline{V}}alue-aligned \textbf{\underline{D}}ata \textbf{\underline{F}}iltering (DVDF) method. We design a range of dynamics shift settings, including kinematic and morphology shifts, and evaluate DVDF on various tasks and datasets, as well as in challenging extremely low-data settings where the target domain dataset contains only 5,000 transitions. Extensive experiments demonstrate that DVDF consistently outperforms prior strong baselines and delivers exceptional performance across multiple tasks and datasets.

Cross-Domain Offline Policy Adaptation with Dynamics- and Value-Aligned Data Filtering

TL;DR

The paper tackles cross-domain offline RL under dynamics shifts by showing that both dynamics and value alignment are needed for effective transfer. It introduces DVDF, a plug-in data-filtering method that jointly evaluates dynamics alignment (via a contrastive score) and value alignment (via a pre-trained source-policy advantage) to selectively reuse source-domain data. DVDF integrates with strong baselines (IGDF/OTDF), uses a pre-trained SQL policy to estimate advantages, and weights source TD losses by a combined score, leading to consistent gains across multiple tasks and extremely low target-data scenarios. The theoretical bound and extensive experiments demonstrate DVDF’s ability to significantly improve target-domain performance by mitigating both dynamics mismatch and value misalignment.

Abstract

Cross-Domain Offline Reinforcement Learning aims to train an agent deployed in the target environment, leveraging both a limited target domain dataset and a source domain dataset with (possibly) sufficient data coverage. Due to the underlying dynamics misalignment between the source and target domain, simply merging the data from two datasets may incur inferior performance. Recent advances address this issue by selectively sharing source domain samples that exhibit dynamics alignment with the target domain. However, these approaches focus solely on dynamics alignment and overlook \textit{value alignment}, i.e., selecting high-quality, high-value samples from the source domain. In this paper, we first demonstrate that both dynamics alignment and value alignment are essential for policy learning, by examining the limitations of the current theoretical framework for cross-domain RL and establishing a concrete sub-optimality gap of a policy trained on the source domain and evaluated on the target domain. Motivated by the theoretical insights, we propose to selectively share those source domain samples with both high dynamics and value alignment and present our \textbf{\underline{D}}ynamics- and \textbf{\underline{V}}alue-aligned \textbf{\underline{D}}ata \textbf{\underline{F}}iltering (DVDF) method. We design a range of dynamics shift settings, including kinematic and morphology shifts, and evaluate DVDF on various tasks and datasets, as well as in challenging extremely low-data settings where the target domain dataset contains only 5,000 transitions. Extensive experiments demonstrate that DVDF consistently outperforms prior strong baselines and delivers exceptional performance across multiple tasks and datasets.

Paper Structure

This paper contains 36 sections, 5 theorems, 51 equations, 5 figures, 8 tables, 2 algorithms.

Key Result

Lemma 4.1

Denote the MDP of the source domain and target domain as $\mathcal{M}_{\mathrm{src}}$ and $\mathcal{M}_{\mathrm{tar}}$. We have the performance difference of a policy $\pi$ under $\mathcal{M}_{\mathrm{src}}$ and $\mathcal{M}_{\mathrm{tar}}$ as below, where $C_1=\frac{2\gamma r_{\max}}{(1-\gamma)^2}$ is a positive constant.

Figures (5)

  • Figure 1: (a): Robot morphology visualization of target domain (left) and source domain (right). (b): Source data filtering visualization of IGDF. (c): Source data filtering visualization of DVDF. (d): Performance comparison between IGDF and DVDF on the target domain.
  • Figure 2: Ablation study on SQL pre-trained advantage function.
  • Figure 3: Parameter sensitivity experiments on $\lambda$ and $\xi$.
  • Figure 4: Visualization of the target domains and source domains with kinematic shifts and morphology shifts, across four tasks (ant, halfcheetah, hopper, walker2d).
  • Figure 5: (a): Visualization of source domain data. (b): Source domain data filtering visualization of Value-IGDF. (c): Performance comparison between IGDF, Value-IGDF and DVDF.

Theorems & Definitions (12)

  • Lemma 4.1: Performance difference bounded by the dynamics misalignment
  • Proposition 4.1: Sub-optimality gap on target domain
  • Proposition 5.1: Value Misalignment
  • Remark 1
  • Remark 2
  • Proposition 5.2: Advantage Approximation Error
  • proof
  • Corollary B.1: Tighter Performance Bound.
  • proof
  • proof
  • ...and 2 more