Learning from Random Demonstrations: Offline Reinforcement Learning with Importance-Sampled Diffusion Models
Zeyu Fang, Tian Lan
TL;DR
ADEPT tackles offline reinforcement learning under distribution shift by introducing a closed-loop framework that pairs policy evaluation with a guided diffusion world model and an importance-sampled update to align the model with the evolving policy. It provides a theoretical bound on the return gap between the real environment and the diffusion-based training environment, expressed in terms of reward, transition, and policy-shift errors. Empirically, ADEPT yields strong improvements over state-of-the-art baselines on D4RL MuJoCo tasks, especially on random and medium-replay datasets, demonstrating the value of adaptive diffusion and policy-guided evaluation in offline settings. The approach offers practical benefits for leveraging diffusion models as a controllable world-model component to mitigate distribution shift in offline RL, albeit with limitations in more complex or partially observable environments.
Abstract
Generative models such as diffusion have been employed as world models in offline reinforcement learning to generate synthetic data for more effective learning. Existing work either generates diffusion models one-time prior to training or requires additional interaction data to update it. In this paper, we propose a novel approach for offline reinforcement learning with closed-loop policy evaluation and world-model adaptation. It iteratively leverages a guided diffusion world model to directly evaluate the offline target policy with actions drawn from it, and then performs an importance-sampled world model update to adaptively align the world model with the updated policy. We analyzed the performance of the proposed method and provided an upper bound on the return gap between our method and the real environment under an optimal policy. The result sheds light on various factors affecting learning performance. Evaluations in the D4RL environment show significant improvement over state-of-the-art baselines, especially when only random or medium-expertise demonstrations are available -- thus requiring improved alignment between the world model and offline policy evaluation.
