Table of Contents
Fetching ...

Contrastive Representation for Data Filtering in Cross-Domain Offline Reinforcement Learning

Xiaoyu Wen, Chenjia Bai, Kang Xu, Xudong Yu, Yang Zhang, Xuelong Li, Zhen Wang

TL;DR

This work tackles cross-domain offline reinforcement learning where source-domain dynamics differ from the target, causing performance drops when data are naively merged. It introduces the mutual-information gap ΔI between joint state-action information and future states as a robust domain discrepancy measure and derives a contrastive objective to estimate ΔI directly via InfoNCE, using target-domain transitions as positives and source-domain transitions as negatives. The method IGDF uses learned encoders to score source transitions and selectively shares the most compatible transitions with the target, optionally weighting TD-errors by a learned score, and provides a performance bound showing how reducing ΔI tightens the gap between target and shared data performance. Empirically, IGDF plus an offline RL backbone outperforms state-of-the-art baselines across diverse Mujoco dynamics-shift tasks, achieving superior data efficiency (e.g., using 10% target data to reach near 90% of full-target performance) and demonstrating robustness to large domain gaps. The approach offers a practical, theoretically grounded way to exploit cross-domain data in offline RL with broad applicability and minimal tuning.

Abstract

Cross-domain offline reinforcement learning leverages source domain data with diverse transition dynamics to alleviate the data requirement for the target domain. However, simply merging the data of two domains leads to performance degradation due to the dynamics mismatch. Existing methods address this problem by measuring the dynamics gap via domain classifiers while relying on the assumptions of the transferability of paired domains. In this paper, we propose a novel representation-based approach to measure the domain gap, where the representation is learned through a contrastive objective by sampling transitions from different domains. We show that such an objective recovers the mutual-information gap of transition functions in two domains without suffering from the unbounded issue of the dynamics gap in handling significantly different domains. Based on the representations, we introduce a data filtering algorithm that selectively shares transitions from the source domain according to the contrastive score functions. Empirical results on various tasks demonstrate that our method achieves superior performance, using only 10% of the target data to achieve 89.2% of the performance on 100% target dataset with state-of-the-art methods.

Contrastive Representation for Data Filtering in Cross-Domain Offline Reinforcement Learning

TL;DR

This work tackles cross-domain offline reinforcement learning where source-domain dynamics differ from the target, causing performance drops when data are naively merged. It introduces the mutual-information gap ΔI between joint state-action information and future states as a robust domain discrepancy measure and derives a contrastive objective to estimate ΔI directly via InfoNCE, using target-domain transitions as positives and source-domain transitions as negatives. The method IGDF uses learned encoders to score source transitions and selectively shares the most compatible transitions with the target, optionally weighting TD-errors by a learned score, and provides a performance bound showing how reducing ΔI tightens the gap between target and shared data performance. Empirically, IGDF plus an offline RL backbone outperforms state-of-the-art baselines across diverse Mujoco dynamics-shift tasks, achieving superior data efficiency (e.g., using 10% target data to reach near 90% of full-target performance) and demonstrating robustness to large domain gaps. The approach offers a practical, theoretically grounded way to exploit cross-domain data in offline RL with broad applicability and minimal tuning.

Abstract

Cross-domain offline reinforcement learning leverages source domain data with diverse transition dynamics to alleviate the data requirement for the target domain. However, simply merging the data of two domains leads to performance degradation due to the dynamics mismatch. Existing methods address this problem by measuring the dynamics gap via domain classifiers while relying on the assumptions of the transferability of paired domains. In this paper, we propose a novel representation-based approach to measure the domain gap, where the representation is learned through a contrastive objective by sampling transitions from different domains. We show that such an objective recovers the mutual-information gap of transition functions in two domains without suffering from the unbounded issue of the dynamics gap in handling significantly different domains. Based on the representations, we introduce a data filtering algorithm that selectively shares transitions from the source domain according to the contrastive score functions. Empirical results on various tasks demonstrate that our method achieves superior performance, using only 10% of the target data to achieve 89.2% of the performance on 100% target dataset with state-of-the-art methods.
Paper Structure (41 sections, 6 theorems, 41 equations, 8 figures, 10 tables, 2 algorithms)

This paper contains 41 sections, 6 theorems, 41 equations, 8 figures, 10 tables, 2 algorithms.

Key Result

Theorem 3.1

The MI gap $\Delta I = I_{\rm tar}([S,A];S') - I_{\rm src}([S,A];S')$ can be lower bounded by the negative contrastive objective, as where $K-1$ is the number of negative samples from the source domain.

Figures (8)

  • Figure 1: (a) Comparison of performance across five seeds in nine Mujoco tasks (ha: halfcheetah, ho: hopper, wa: walker2d, m: medium, mr: medium-replay, me: medium-expert) with IQL iql. We set standard D4RL d4rl as the target domain data. For the source domain, we modify environmental parameters, such as altering body mass or introducing joint noise, and then collect offline datasets in the modified environments. (IQL-100%: use 100% target data, IQL-10%: use reduced 10% target data, IQL-Mix: use 10% target data and 100% source data.) (b) Comparison of performance between our algorithm and DARA dara with 100% source-domain dataset and 10% target-domain dataset in the Hopper-Medium-v2 when facing the increasing dynamics gap. Specifically, we simulate a process of increasing dynamics gaps by continuously increasing the head size in the Hopper-v2 environment. The x-axis is the head size of the Hopper-v2 (normal size is 0.05), and "Org Score" is the original performance of IQL when using 100% target data.
  • Figure 2: An illustration of the MI gap of data shared from ${\mathcal{D}}_{\rm src}$.
  • Figure 3: Illustration of our method. (a) We train two encoder networks using contrastive learning, treating target transitions as positive examples and constructed transitions as negative examples. (b) We tackle cross-domain offline RL by selectively sharing the source domain data with the score functions. The target data and the share data are used for offline RL algorithms to learn the policy.
  • Figure 4: Sensitivity on the amount of target-domain data.
  • Figure 5: Sensitivity on the importance coefficient.
  • ...and 3 more figures

Theorems & Definitions (9)

  • Theorem 3.1: InfoNCE extension
  • Theorem 3.2
  • Theorem 3.3
  • Theorem 1.1
  • proof
  • Theorem 1.2
  • proof
  • Theorem 1.3
  • proof