Table of Contents
Fetching ...

Bridging Dynamics Gaps via Diffusion Schrödinger Bridge for Cross-Domain Reinforcement Learning

Hanping Zhang, Yuhong Guo

TL;DR

Bridging Dynamics Gaps for Cross-Domain Reinforcement Learning (BDGxRL) is proposed, a novel framework that leverages Diffusion Schrodinger Bridge to align source transitions with target-domain dynamics encoded in offline demonstrations and introduces a reward modulation mechanism that estimates rewards based on state transitions.

Abstract

Cross-domain reinforcement learning (RL) aims to learn transferable policies under dynamics shifts between source and target domains. A key challenge lies in the lack of target-domain environment interaction and reward supervision, which prevents direct policy learning. To address this challenge, we propose Bridging Dynamics Gaps for Cross-Domain Reinforcement Learning (BDGxRL), a novel framework that leverages Diffusion Schrödinger Bridge (DSB) to align source transitions with target-domain dynamics encoded in offline demonstrations. Moreover, we introduce a reward modulation mechanism that estimates rewards based on state transitions, applying to DSB-aligned samples to ensure consistency between rewards and target-domain dynamics. BDGxRL performs target-oriented policy learning entirely within the source domain, without access to the target environment or its rewards. Experiments on MuJoCo cross-domain benchmarks demonstrate that BDGxRL outperforms state-of-the-art baselines and shows strong adaptability under transition dynamics shifts.

Bridging Dynamics Gaps via Diffusion Schrödinger Bridge for Cross-Domain Reinforcement Learning

TL;DR

Bridging Dynamics Gaps for Cross-Domain Reinforcement Learning (BDGxRL) is proposed, a novel framework that leverages Diffusion Schrodinger Bridge to align source transitions with target-domain dynamics encoded in offline demonstrations and introduces a reward modulation mechanism that estimates rewards based on state transitions.

Abstract

Cross-domain reinforcement learning (RL) aims to learn transferable policies under dynamics shifts between source and target domains. A key challenge lies in the lack of target-domain environment interaction and reward supervision, which prevents direct policy learning. To address this challenge, we propose Bridging Dynamics Gaps for Cross-Domain Reinforcement Learning (BDGxRL), a novel framework that leverages Diffusion Schrödinger Bridge (DSB) to align source transitions with target-domain dynamics encoded in offline demonstrations. Moreover, we introduce a reward modulation mechanism that estimates rewards based on state transitions, applying to DSB-aligned samples to ensure consistency between rewards and target-domain dynamics. BDGxRL performs target-oriented policy learning entirely within the source domain, without access to the target environment or its rewards. Experiments on MuJoCo cross-domain benchmarks demonstrate that BDGxRL outperforms state-of-the-art baselines and shows strong adaptability under transition dynamics shifts.
Paper Structure (31 sections, 3 theorems, 28 equations, 2 figures, 3 tables, 1 algorithm)

This paper contains 31 sections, 3 theorems, 28 equations, 2 figures, 3 tables, 1 algorithm.

Key Result

Theorem 1

Assume the reward is bounded by $R_{\max}$, and the discount factor satisfies $\gamma \in [0,1)$. Let $\pi$ be the policy learned using BDGxRL with DSB-based dynamics translation and reward modulation, and let $\pi^\star$ denote the optimal policy in the target environment. Then, when the number of where $\epsilon_m = \mathcal{O}\left( \frac{1}{N_S} + \frac{1}{N_T} \right)$ denotes the dynamics a

Figures (2)

  • Figure 1: Overview of the proposed BDGxRL framework. The agent first collects a dataset $\mathcal{D}_S$ from the source environment, which is used to train a transition-aware reward model $R(s_t,s_{t+1})$. Together with offline target demonstrations $\mathcal{D}_T$, it also trains a DSB model for dynamics alignment. During online interactions, source transitions are translated into target-style transitions via $\tilde{s}_{t+1}\sim\mathrm{DSB}(s_t,a_t,s_{t+1})$ to mitigate dynamics mismatch. The modulated reward $\tilde{r}_t=R(s_t,\tilde{s}_{t+1})$ is then used to learn a target-oriented policy entirely within the source domain, initialized via imitation from $\mathcal{D}_T$.
  • Figure 2: Overall average performance of each method across all tasks, domain gaps, and demonstration levels.

Theorems & Definitions (5)

  • Theorem 1: Policy Value Bound under DSB Translation
  • Lemma 1: Transition Model Error Bound
  • proof
  • Theorem 2: Policy Value Bound under DSB Translation
  • proof