Table of Contents
Fetching ...

Off-dynamics Conditional Diffusion Planners

Wen Zheng Terence Ng, Jianda Chen, Tianwei Zhang

TL;DR

This work proposes a novel approach using conditional Diffusion Probabilistic Models (DPMs) to learn the joint distribution of the large-scale off-dynamics dataset and the limited target dataset and demonstrates that by modifying the context, the model can interpolate between source and target dynamics, making it more robust to subtle shifts in the environment.

Abstract

Offline Reinforcement Learning (RL) offers an attractive alternative to interactive data acquisition by leveraging pre-existing datasets. However, its effectiveness hinges on the quantity and quality of the data samples. This work explores the use of more readily available, albeit off-dynamics datasets, to address the challenge of data scarcity in Offline RL. We propose a novel approach using conditional Diffusion Probabilistic Models (DPMs) to learn the joint distribution of the large-scale off-dynamics dataset and the limited target dataset. To enable the model to capture the underlying dynamics structure, we introduce two contexts for the conditional model: (1) a continuous dynamics score allows for partial overlap between trajectories from both datasets, providing the model with richer information; (2) an inverse-dynamics context guides the model to generate trajectories that adhere to the target environment's dynamic constraints. Empirical results demonstrate that our method significantly outperforms several strong baselines. Ablation studies further reveal the critical role of each dynamics context. Additionally, our model demonstrates that by modifying the context, we can interpolate between source and target dynamics, making it more robust to subtle shifts in the environment.

Off-dynamics Conditional Diffusion Planners

TL;DR

This work proposes a novel approach using conditional Diffusion Probabilistic Models (DPMs) to learn the joint distribution of the large-scale off-dynamics dataset and the limited target dataset and demonstrates that by modifying the context, the model can interpolate between source and target dynamics, making it more robust to subtle shifts in the environment.

Abstract

Offline Reinforcement Learning (RL) offers an attractive alternative to interactive data acquisition by leveraging pre-existing datasets. However, its effectiveness hinges on the quantity and quality of the data samples. This work explores the use of more readily available, albeit off-dynamics datasets, to address the challenge of data scarcity in Offline RL. We propose a novel approach using conditional Diffusion Probabilistic Models (DPMs) to learn the joint distribution of the large-scale off-dynamics dataset and the limited target dataset. To enable the model to capture the underlying dynamics structure, we introduce two contexts for the conditional model: (1) a continuous dynamics score allows for partial overlap between trajectories from both datasets, providing the model with richer information; (2) an inverse-dynamics context guides the model to generate trajectories that adhere to the target environment's dynamic constraints. Empirical results demonstrate that our method significantly outperforms several strong baselines. Ablation studies further reveal the critical role of each dynamics context. Additionally, our model demonstrates that by modifying the context, we can interpolate between source and target dynamics, making it more robust to subtle shifts in the environment.

Paper Structure

This paper contains 19 sections, 9 equations, 3 figures, 1 table, 1 algorithm.

Figures (3)

  • Figure 1: (Left) We utilize an accessible off-dynamics source dataset to enhance a limited target dataset for Offline RL. Our goal is to generate optimal trajectories within the green region. (Right) By conditioning a diffusion planner with our proposed continuous dynamics score, we enable the model to capture the underlying dynamics structure within the latent space through overlapping dynamics information.
  • Figure 2: Ablation: models trained with different contexts following Algorithm \ref{['algo:1']}. We report the mean normalized score across different settings per environment. ('R', Orange) represents the base model conditioned on only the return. ('R+OH', LightBlue) adds on the one-hot source/target label as contexts to base model. ('R+DS', Blue) adds on the dynamics score as contexts to base model. ('R+DS+ID', Green) further adds on the inverse-dynamics context. ('R+DS+IA', LightGreen) applies inverse action on 'R+DS'.
  • Figure 3: Plot of the generalisation capabilities for Halfcheetah. Models are trained on $\mathcal{D}_\text{source}$ ($m=14$), $\mathcal{D}_\text{target}$ ($m=7$). Models are evaluated at interpolated masses $8\leq m \leq13$ and extrapolated masses $3 \leq m \leq 6$. Mean returns are shown due to varying normalizing score factors across different masses.