Table of Contents
Fetching ...

Exploratory Diffusion Model for Unsupervised Reinforcement Learning

Chengyang Ying, Huayu Chen, Xinning Zhou, Zhongkai Hao, Hang Su, Jun Zhu

TL;DR

This work addresses unsupervised reinforcement learning (URL) where rewards are absent during pre-training and aims to enable fast downstream adaptation. It introduces Exploratory Diffusion Model (ExDM), which uses diffusion models to fit heterogeneous replay-buffer data and derives a score-based intrinsic reward $\mathcal{R}_{\mathrm{score}}$ to drive exploration, while employing a Gaussian behavior policy for efficient data collection. For downstream fine-tuning, ExDM adopts an alternating optimization framework that combines $J_{\mathrm{f}}(\pi) = J(\pi) - \frac{\beta}{1-\gamma} \mathbb{E}_{s\sim d_{\pi}} [D_{\mathrm{KL}}(\pi(\cdot|s) || \pi_{\mathrm{d}}(\cdot|s))]$ with energy-guided diffusion policy updates and IQL-based Q-function learning, complemented by diffusion-policy distillation via contrastive energy prediction (CEP) to enable efficient online refinement. Empirical results across Maze2d and URLB show state-of-the-art exploration efficiency and fast downstream adaptation, especially in structurally complex environments, validating diffusion-based modeling as a practical approach for high-fidelity, reward-free pre-training. Overall, ExDM demonstrates that diffusion models can capture highly diverse exploration data and provide a principled path toward scalable unsupervised pre-training and rapid fine-tuning in RL.

Abstract

Unsupervised reinforcement learning (URL) aims to pre-train agents by exploring diverse states or skills in reward-free environments, facilitating efficient adaptation to downstream tasks. As the agent cannot access extrinsic rewards during unsupervised exploration, existing methods design intrinsic rewards to model the explored data and encourage further exploration. However, the explored data are always heterogeneous, posing the requirements of powerful representation abilities for both intrinsic reward models and pre-trained policies. In this work, we propose the Exploratory Diffusion Model (ExDM), which leverages the strong expressive ability of diffusion models to fit the explored data, simultaneously boosting exploration and providing an efficient initialization for downstream tasks. Specifically, ExDM can accurately estimate the distribution of collected data in the replay buffer with the diffusion model and introduces the score-based intrinsic reward, encouraging the agent to explore less-visited states. After obtaining the pre-trained policies, ExDM enables rapid adaptation to downstream tasks. In detail, we provide theoretical analyses and practical algorithms for fine-tuning diffusion policies, addressing key challenges such as training instability and computational complexity caused by multi-step sampling. Extensive experiments demonstrate that ExDM outperforms existing SOTA baselines in efficient unsupervised exploration and fast fine-tuning downstream tasks, especially in structurally complicated environments.

Exploratory Diffusion Model for Unsupervised Reinforcement Learning

TL;DR

This work addresses unsupervised reinforcement learning (URL) where rewards are absent during pre-training and aims to enable fast downstream adaptation. It introduces Exploratory Diffusion Model (ExDM), which uses diffusion models to fit heterogeneous replay-buffer data and derives a score-based intrinsic reward to drive exploration, while employing a Gaussian behavior policy for efficient data collection. For downstream fine-tuning, ExDM adopts an alternating optimization framework that combines with energy-guided diffusion policy updates and IQL-based Q-function learning, complemented by diffusion-policy distillation via contrastive energy prediction (CEP) to enable efficient online refinement. Empirical results across Maze2d and URLB show state-of-the-art exploration efficiency and fast downstream adaptation, especially in structurally complex environments, validating diffusion-based modeling as a practical approach for high-fidelity, reward-free pre-training. Overall, ExDM demonstrates that diffusion models can capture highly diverse exploration data and provide a principled path toward scalable unsupervised pre-training and rapid fine-tuning in RL.

Abstract

Unsupervised reinforcement learning (URL) aims to pre-train agents by exploring diverse states or skills in reward-free environments, facilitating efficient adaptation to downstream tasks. As the agent cannot access extrinsic rewards during unsupervised exploration, existing methods design intrinsic rewards to model the explored data and encourage further exploration. However, the explored data are always heterogeneous, posing the requirements of powerful representation abilities for both intrinsic reward models and pre-trained policies. In this work, we propose the Exploratory Diffusion Model (ExDM), which leverages the strong expressive ability of diffusion models to fit the explored data, simultaneously boosting exploration and providing an efficient initialization for downstream tasks. Specifically, ExDM can accurately estimate the distribution of collected data in the replay buffer with the diffusion model and introduces the score-based intrinsic reward, encouraging the agent to explore less-visited states. After obtaining the pre-trained policies, ExDM enables rapid adaptation to downstream tasks. In detail, we provide theoretical analyses and practical algorithms for fine-tuning diffusion policies, addressing key challenges such as training instability and computational complexity caused by multi-step sampling. Extensive experiments demonstrate that ExDM outperforms existing SOTA baselines in efficient unsupervised exploration and fast fine-tuning downstream tasks, especially in structurally complicated environments.

Paper Structure

This paper contains 58 sections, 2 theorems, 36 equations, 11 figures, 4 tables, 2 algorithms.

Key Result

Theorem 3.1

When $\mathcal{S}, \mathcal{A}$ are discrete spaces, i.e., $|\mathcal{S}| = S, |\mathcal{A}| = A$, there are $M\triangleq A^S$ deterministic policies. Set $\hat{\pi} = \mathop{\mathrm{arg\,max}}\limits_{\pi} \mathcal{H}(d_{\pi}(\cdot))$, under some mild assumptions, we have will fast converge to 1 with the increasing of $A$, and here $v(S)$ is a constant only related to $S$ and satisfies $0 < v(S

Figures (11)

  • Figure 1: Overview of Exploratory Diffusion Model (ExDM). Different from standard RL, URL aims to explore in reward-free environments, requiring expressive policies and models to fit heterogeneous data (Theorem \ref{['thm_1']}). During pre-training, ExDM employs the diffusion model to model the heterogeneous exploration data and calculate score-based intrinsic rewards to encourage exploration. Moreover, we adopt a Gaussian behavior policy to collect data that avoids the inefficiency caused by the multi-step sampling of the diffusion policy.
  • Figure 2: Visualization of trajectories explored by different URL methods in the most complicated mazes. Full results of all 11 algorithms in 7 mazes are in Appendix \ref{['app_exp_maze']}.
  • Figure 3: State coverage ratios of different algorithms in 7 mazes during pre-training.
  • Figure 4: Aggregate metrics agarwal2021deep in URLB fine-tuned by DDPG. Each statistic for every algorithm has 160 runs (4 domains × 4 downstream tasks × 10 seeds).
  • Figure 5: Aggregate metrics agarwal2021deep in URLB of different fine-tuning methods for diffusion policies.
  • ...and 6 more figures

Theorems & Definitions (3)

  • Theorem 3.1: Details and proof are in Appendix \ref{['app_proof_thm_1']}
  • Theorem 3.2: Proof in Appendix \ref{['app_proof_thm2']}
  • proof