Table of Contents
Fetching ...

DmC: Nearest Neighbor Guidance Diffusion Model for Offline Cross-domain Reinforcement Learning

Linh Le Pham Van, Minh Hoang Nguyen, Duc Kieu, Hung Le, Hung The Tran, Sunil Gupta

TL;DR

DmC is proposed, a novel framework for cross-domain offline RL with limited target samples that utilizes $k-neighbor-guided diffusion model to measure domain proximity without neural network training, effectively mitigating overfitting and introduces a nearest-neighbor-guided diffusion model to generate additional source samples that are better aligned with the target domain.

Abstract

Cross-domain offline reinforcement learning (RL) seeks to enhance sample efficiency in offline RL by utilizing additional offline source datasets. A key challenge is to identify and utilize source samples that are most relevant to the target domain. Existing approaches address this challenge by measuring domain gaps through domain classifiers, target transition dynamics modeling, or mutual information estimation using contrastive loss. However, these methods often require large target datasets, which is impractical in many real-world scenarios. In this work, we address cross-domain offline RL under a limited target data setting, identifying two primary challenges: (1) Dataset imbalance, which is caused by large source and small target datasets and leads to overfitting in neural network-based domain gap estimators, resulting in uninformative measurements; and (2) Partial domain overlap, where only a subset of the source data is closely aligned with the target domain. To overcome these issues, we propose DmC, a novel framework for cross-domain offline RL with limited target samples. Specifically, DmC utilizes $k$-nearest neighbor ($k$-NN) based estimation to measure domain proximity without neural network training, effectively mitigating overfitting. Then, by utilizing this domain proximity, we introduce a nearest-neighbor-guided diffusion model to generate additional source samples that are better aligned with the target domain, thus enhancing policy learning with more effective source samples. Through theoretical analysis and extensive experiments in diverse MuJoCo environments, we demonstrate that DmC significantly outperforms state-of-the-art cross-domain offline RL methods, achieving substantial performance gains.

DmC: Nearest Neighbor Guidance Diffusion Model for Offline Cross-domain Reinforcement Learning

TL;DR

DmC is proposed, a novel framework for cross-domain offline RL with limited target samples that utilizes $k-neighbor-guided diffusion model to measure domain proximity without neural network training, effectively mitigating overfitting and introduces a nearest-neighbor-guided diffusion model to generate additional source samples that are better aligned with the target domain.

Abstract

Cross-domain offline reinforcement learning (RL) seeks to enhance sample efficiency in offline RL by utilizing additional offline source datasets. A key challenge is to identify and utilize source samples that are most relevant to the target domain. Existing approaches address this challenge by measuring domain gaps through domain classifiers, target transition dynamics modeling, or mutual information estimation using contrastive loss. However, these methods often require large target datasets, which is impractical in many real-world scenarios. In this work, we address cross-domain offline RL under a limited target data setting, identifying two primary challenges: (1) Dataset imbalance, which is caused by large source and small target datasets and leads to overfitting in neural network-based domain gap estimators, resulting in uninformative measurements; and (2) Partial domain overlap, where only a subset of the source data is closely aligned with the target domain. To overcome these issues, we propose DmC, a novel framework for cross-domain offline RL with limited target samples. Specifically, DmC utilizes -nearest neighbor (-NN) based estimation to measure domain proximity without neural network training, effectively mitigating overfitting. Then, by utilizing this domain proximity, we introduce a nearest-neighbor-guided diffusion model to generate additional source samples that are better aligned with the target domain, thus enhancing policy learning with more effective source samples. Through theoretical analysis and extensive experiments in diverse MuJoCo environments, we demonstrate that DmC significantly outperforms state-of-the-art cross-domain offline RL methods, achieving substantial performance gains.

Paper Structure

This paper contains 54 sections, 4 theorems, 40 equations, 8 figures, 14 tables, 1 algorithm.

Key Result

Theorem 1

Denote $D_{src}$ as the offline source dataset from source domain $\mathcal{M}_{src}$ and $D_{tar}$ as the offline target dataset from target domain $\mathcal{M}_{tar}$. Let the empirical policy in the offline target dataset $D_{tar}$ be $\pi_{D_{tar}} = \frac{\sum_{D_{tar}}\mathds{1}(s, a)}{\sum_{D

Figures (8)

  • Figure 1: Histogram of target-domain probabilities predicted for source samples in the Ant environment using the pretrained domain classifier (DARA). The classifier concentrated around limited probabilities, offering little information about the domain gaps between the two domains.
  • Figure 2: Nearest Neighbor Distance histograms. Blue shows the distance of source samples to their nearest target samples. Pink shows the distance of target samples to their nearest target samples.
  • Figure 3: Illustration of our method. First, we use $k$-NN estimation to quantify the domain gap score. Next, we leverage a diffusion model to upsample the source data, generating samples close to the target domain. The datasets are then utilized in an offline RL framework, incorporating a regularization term to ensure the learned policy remains within the support region of the target dataset.
  • Figure 4: Estimated dynamics gaps between target and different source variants. DmC distribution is concentrated at a lower gap value.
  • Figure 5: (Left) Histogram of the predicted target likelihood of the domain classifiers in DARA for the source sample in the source dataset. (Middle) The histogram of the predicted target likelihood of the CVAE dynamics target model proposed in BOSA. (Right) The histogram of the reward penalty values computed using domain classifiers in DARA.
  • ...and 3 more figures

Theorems & Definitions (4)

  • Theorem 1
  • Lemma 2
  • Lemma 3
  • Theorem 4: Performance Bound