Table of Contents
Fetching ...

Learning from Random Demonstrations: Offline Reinforcement Learning with Importance-Sampled Diffusion Models

Zeyu Fang, Tian Lan

TL;DR

ADEPT tackles offline reinforcement learning under distribution shift by introducing a closed-loop framework that pairs policy evaluation with a guided diffusion world model and an importance-sampled update to align the model with the evolving policy. It provides a theoretical bound on the return gap between the real environment and the diffusion-based training environment, expressed in terms of reward, transition, and policy-shift errors. Empirically, ADEPT yields strong improvements over state-of-the-art baselines on D4RL MuJoCo tasks, especially on random and medium-replay datasets, demonstrating the value of adaptive diffusion and policy-guided evaluation in offline settings. The approach offers practical benefits for leveraging diffusion models as a controllable world-model component to mitigate distribution shift in offline RL, albeit with limitations in more complex or partially observable environments.

Abstract

Generative models such as diffusion have been employed as world models in offline reinforcement learning to generate synthetic data for more effective learning. Existing work either generates diffusion models one-time prior to training or requires additional interaction data to update it. In this paper, we propose a novel approach for offline reinforcement learning with closed-loop policy evaluation and world-model adaptation. It iteratively leverages a guided diffusion world model to directly evaluate the offline target policy with actions drawn from it, and then performs an importance-sampled world model update to adaptively align the world model with the updated policy. We analyzed the performance of the proposed method and provided an upper bound on the return gap between our method and the real environment under an optimal policy. The result sheds light on various factors affecting learning performance. Evaluations in the D4RL environment show significant improvement over state-of-the-art baselines, especially when only random or medium-expertise demonstrations are available -- thus requiring improved alignment between the world model and offline policy evaluation.

Learning from Random Demonstrations: Offline Reinforcement Learning with Importance-Sampled Diffusion Models

TL;DR

ADEPT tackles offline reinforcement learning under distribution shift by introducing a closed-loop framework that pairs policy evaluation with a guided diffusion world model and an importance-sampled update to align the model with the evolving policy. It provides a theoretical bound on the return gap between the real environment and the diffusion-based training environment, expressed in terms of reward, transition, and policy-shift errors. Empirically, ADEPT yields strong improvements over state-of-the-art baselines on D4RL MuJoCo tasks, especially on random and medium-replay datasets, demonstrating the value of adaptive diffusion and policy-guided evaluation in offline settings. The approach offers practical benefits for leveraging diffusion models as a controllable world-model component to mitigate distribution shift in offline RL, albeit with limitations in more complex or partially observable environments.

Abstract

Generative models such as diffusion have been employed as world models in offline reinforcement learning to generate synthetic data for more effective learning. Existing work either generates diffusion models one-time prior to training or requires additional interaction data to update it. In this paper, we propose a novel approach for offline reinforcement learning with closed-loop policy evaluation and world-model adaptation. It iteratively leverages a guided diffusion world model to directly evaluate the offline target policy with actions drawn from it, and then performs an importance-sampled world model update to adaptively align the world model with the updated policy. We analyzed the performance of the proposed method and provided an upper bound on the return gap between our method and the real environment under an optimal policy. The result sheds light on various factors affecting learning performance. Evaluations in the D4RL environment show significant improvement over state-of-the-art baselines, especially when only random or medium-expertise demonstrations are available -- thus requiring improved alignment between the world model and offline policy evaluation.
Paper Structure (22 sections, 8 theorems, 35 equations, 3 figures, 4 tables, 1 algorithm)

This paper contains 22 sections, 8 theorems, 35 equations, 3 figures, 4 tables, 1 algorithm.

Key Result

Theorem 4.4

Given $\hat{\varepsilon}_r$, $\hat{\varepsilon}_m$ and $\hat{\varepsilon}_\pi$, the bound $C$ between the true return and the IS-diffusion model-based return can be expressed as follows:

Figures (3)

  • Figure 1: The existing algorithms as shown in (a) use a fixed world model in offline RL (or require additional data from the environment to update). Our proposed approach, as illustrated in (b1) to (b3), adapts the diffusion model from $\theta_k$ to $\theta_{k+1}$ with importance-sampling regarding the updated policy $\pi_{k+1}$, and then evaluates the current policy $\pi_{k+1}$ with guided diffusion to obtain $\pi_{k+2}$. Each point refers to a sample in the offline dataset, while darker points are given higher weight in loss calculation for diffusion world model update. The performance of the proposed method is analyzed in this paper.
  • Figure 2: An overview of our ADEPT algorithm. It iteratively leverages a guided diffusion world model to directly evaluate the target policy with actions drawn from it, and then performs an importance-sampled world model update to adaptively align the world model with the updated policy. An upper bound on the return gap is analyzed in Section 4.3.
  • Figure 3: The training curves of different methods on halfcheetah environment for our ablation study. In particular, Diff+IQL and Diff+SAC generate synthetic data while removing the use of importance sampling, using a diffusion model trained one-time before training. IQL and SAC further remove the use of any synthetic data. All other settings remain the same with ADEPT methods. The results demonstrate the critical role of diffusion world model adaptation in offline RL.

Theorems & Definitions (16)

  • Definition 4.1
  • Definition 4.2
  • Definition 4.3
  • Theorem 4.4
  • proof
  • Lemma 4.5
  • Lemma 4.6
  • Lemma 4.7
  • Lemma A.1
  • proof
  • ...and 6 more