Table of Contents
Fetching ...

Dual-Robust Cross-Domain Offline Reinforcement Learning Against Dynamics Shifts

Zhongjian Qiao, Rui Yang, Jiafei Lyu, Xiu Li, Zhongxiang Dai, Zhuoran Yang, Siyang Gao, Shuang Qiu

TL;DR

This work addresses the vulnerability of cross-domain offline RL to both train-time and test-time dynamics shifts. It introduces the robust cross-domain Bellman (RCB) operator and the DROCO algorithm, incorporating a dynamic value penalty and a Huber loss to mitigate value estimation errors. Theoretical results establish contraction and dual robustness properties, while extensive experiments across kinematic and morphology shifts demonstrate improved performance and strengthened test-time robustness over strong baselines. The approach provides a practical pathway for deploying cross-domain offline policies in non-stationary real-world environments with limited target-domain data.

Abstract

Single-domain offline reinforcement learning (RL) often suffers from limited data coverage, while cross-domain offline RL handles this issue by leveraging additional data from other domains with dynamics shifts. However, existing studies primarily focus on train-time robustness (handling dynamics shifts from training data), neglecting the test-time robustness against dynamics perturbations when deployed in practical scenarios. In this paper, we investigate dual (both train-time and test-time) robustness against dynamics shifts in cross-domain offline RL. We first empirically show that the policy trained with cross-domain offline RL exhibits fragility under dynamics perturbations during evaluation, particularly when target domain data is limited. To address this, we introduce a novel robust cross-domain Bellman (RCB) operator, which enhances test-time robustness against dynamics perturbations while staying conservative to the out-of-distribution dynamics transitions, thus guaranteeing the train-time robustness. To further counteract potential value overestimation or underestimation caused by the RCB operator, we introduce two techniques, the dynamic value penalty and the Huber loss, into our framework, resulting in the practical \textbf{D}ual-\textbf{RO}bust \textbf{C}ross-domain \textbf{O}ffline RL (DROCO) algorithm. Extensive empirical results across various dynamics shift scenarios show that DROCO outperforms strong baselines and exhibits enhanced robustness to dynamics perturbations.

Dual-Robust Cross-Domain Offline Reinforcement Learning Against Dynamics Shifts

TL;DR

This work addresses the vulnerability of cross-domain offline RL to both train-time and test-time dynamics shifts. It introduces the robust cross-domain Bellman (RCB) operator and the DROCO algorithm, incorporating a dynamic value penalty and a Huber loss to mitigate value estimation errors. Theoretical results establish contraction and dual robustness properties, while extensive experiments across kinematic and morphology shifts demonstrate improved performance and strengthened test-time robustness over strong baselines. The approach provides a practical pathway for deploying cross-domain offline policies in non-stationary real-world environments with limited target-domain data.

Abstract

Single-domain offline reinforcement learning (RL) often suffers from limited data coverage, while cross-domain offline RL handles this issue by leveraging additional data from other domains with dynamics shifts. However, existing studies primarily focus on train-time robustness (handling dynamics shifts from training data), neglecting the test-time robustness against dynamics perturbations when deployed in practical scenarios. In this paper, we investigate dual (both train-time and test-time) robustness against dynamics shifts in cross-domain offline RL. We first empirically show that the policy trained with cross-domain offline RL exhibits fragility under dynamics perturbations during evaluation, particularly when target domain data is limited. To address this, we introduce a novel robust cross-domain Bellman (RCB) operator, which enhances test-time robustness against dynamics perturbations while staying conservative to the out-of-distribution dynamics transitions, thus guaranteeing the train-time robustness. To further counteract potential value overestimation or underestimation caused by the RCB operator, we introduce two techniques, the dynamic value penalty and the Huber loss, into our framework, resulting in the practical \textbf{D}ual-\textbf{RO}bust \textbf{C}ross-domain \textbf{O}ffline RL (DROCO) algorithm. Extensive empirical results across various dynamics shift scenarios show that DROCO outperforms strong baselines and exhibits enhanced robustness to dynamics perturbations.

Paper Structure

This paper contains 41 sections, 8 theorems, 75 equations, 13 figures, 12 tables, 1 algorithm.

Key Result

Proposition 4.1

The RCB operator is a $\gamma$-contraction operator in the complete state-action space $(\mathbb{R}^{|\mathcal{S}\times\mathcal{A}|},\|\cdot\|_\infty)$ where $\|\cdot\|_\infty$ denotes the $\ell_\infty$ norm, i.e., $\|\mathcal{T}_{\text{RCB}}Q_1-\mathcal{T}_{\text{RCB}}Q_2\|_\infty\leq \gamma\|Q_1-

Figures (13)

  • Figure 1: Performance comparison with different dataset sizes under dynamics perturbations.
  • Figure 2: Evaluation results under different types and levels of dynamics perturbations.
  • Figure 3: Parameter sensitivity experiments on $\beta$ and $\delta$.
  • Figure 4: Visualization of the target domains and source domains with kinematic shifts and morphology shifts, across four tasks (halfcheetah, hopper, walker2d, ant).
  • Figure 5: Evaluation results of IGDF under morphology and min Q perturbations with different sizes of target domain data.
  • ...and 8 more figures

Theorems & Definitions (17)

  • Definition 4.1: RCB operator
  • Proposition 4.1: $\gamma$-contraction
  • Proposition 4.2: Dual Reformulation
  • Definition 4.2: Practical RCB operator
  • Proposition 4.3: $\gamma$-contraction
  • Proposition 4.4: Train-time robustness against dynamics shifts
  • Proposition 4.5: Test-time robustness against dynamics shifts
  • Proposition 4.6: Limited overestimation
  • proof
  • Lemma B.1
  • ...and 7 more