Cross-Domain Offline Policy Adaptation via Selective Transition Correction

Mengbei Yan; Jiafei Lyu; Shengjie Sun; Zhongjian Qiao; Jingwen Yang; Zichuan Lin; Deheng Ye; Xiu Li

Cross-Domain Offline Policy Adaptation via Selective Transition Correction

Mengbei Yan, Jiafei Lyu, Shengjie Sun, Zhongjian Qiao, Jingwen Yang, Zichuan Lin, Deheng Ye, Xiu Li

TL;DR

This paper studies cross-domain offline RL, where an offline dataset from another similar source domain can be accessed to enhance policy learning upon a target domain dataset, and proposes the Selective Transition Correction (STC) algorithm, which enables reliable usage of source domain data for policy adaptation.

Abstract

It remains a critical challenge to adapt policies across domains with mismatched dynamics in reinforcement learning (RL). In this paper, we study cross-domain offline RL, where an offline dataset from another similar source domain can be accessed to enhance policy learning upon a target domain dataset. Directly merging the two datasets may lead to suboptimal performance due to potential dynamics mismatches. Existing approaches typically mitigate this issue through source domain transition filtering or reward modification, which, however, may lead to insufficient exploitation of the valuable source domain data. Instead, we propose to modify the source domain data into the target domain data. To that end, we leverage an inverse policy model and a reward model to correct the actions and rewards of source transitions, explicitly achieving alignment with the target dynamics. Since limited data may result in inaccurate model training, we further employ a forward dynamics model to retain corrected samples that better match the target dynamics than the original transitions. Consequently, we propose the Selective Transition Correction (STC) algorithm, which enables reliable usage of source domain data for policy adaptation. Experiments on various environments with dynamics shifts demonstrate that STC achieves superior performance against existing baselines.

Cross-Domain Offline Policy Adaptation via Selective Transition Correction

TL;DR

Abstract

Paper Structure (53 sections, 7 theorems, 48 equations, 6 figures, 9 tables, 1 algorithm)

This paper contains 53 sections, 7 theorems, 48 equations, 6 figures, 9 tables, 1 algorithm.

Introduction
Related Work
Preliminaries
Selective Source Transition Correction
Source Transition Correction
Theoretical Analysis
Selective Correction Mechanism
Actor-Critic Learning
Experiments
Main Results
Tasks and datasets.
Influence of Target Domain Dataset Quality
Influence of Target Domain Dataset Size
STC Can Correct Source Samples Reliably
Parameter Study
...and 38 more sections

Key Result

Theorem 4.4

Denote the corrected source domain transition dynamics as $\widetilde{P}_{\rm src}$, then under Assumption ass:dynamics and ass:policy, the deviation between the corrected dynamics and the empirical target domain dynamics $\widehat{P}_{\rm tar}(\cdot|s,a)$ is bounded:

Figures (6)

Figure 1: Training pipeline of our proposed STC algorithm. In Phase I, we train the forward dynamics model $f_{\rm fwd}(s,a)$, the reward model $r(s,a)$, and the inverse policy model $f_{\rm inv}(s,s^\prime)$. These models are trained to capture the bidirectional dynamics transition information in the target domain dataset. In Phase II, we sample data from $D_{\rm src}$ and $D_{\rm tar}$ to train an offline RL agent, where we correct the actions and rewards in the source domain transition tuple by using the inverse policy model. We further use the forward dynamics model to selectively correct source transitions to better align with the target domain.
Figure 2: Action distribution comparison in (a) the hopper (gravity shift) and (b) the walker2d (morphology shift) environments. In each subplot, the left panel shows KDE curves comparing original source domain actions and target domain actions, while the right panel shows KDE curves comparing STC-corrected source actions with target actions.
Figure 3: Parameter study. We report target domain returns in two shift tasks. The shaded region captures the standard deviation.
Figure 4: Illustration of the adopted environments. Target domain robots differ from source domain robots (top) by gravity shifts (second row), friction shifts (third row), or morphology shifts (bottom).
Figure 5: Action distribution comparison on ant environment with gravity shift. The left panel shows KDE curves comparing source domain actions and target domain actions, the right panel shows KDE curves comparing STC-corrected source actions with target actions.
...and 1 more figures

Theorems & Definitions (10)

Theorem 4.4
Theorem 4.5
Theorem 4.6: Finite data bound
Theorem 1.1
proof
Theorem 1.2
proof
Theorem 1.3: Finite data bound
proof
Lemma 1.1: Telescoping lemma

Cross-Domain Offline Policy Adaptation via Selective Transition Correction

TL;DR

Abstract

Cross-Domain Offline Policy Adaptation via Selective Transition Correction

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (10)