Off-policy Reinforcement Learning with Model-based Exploration Augmentation

Likun Wang; Xiangteng Zhang; Yinuo Wang; Guojian Zhan; Wenxuan Wang; Haoyu Gao; Jingliang Duan; Shengbo Eben Li

Off-policy Reinforcement Learning with Model-based Exploration Augmentation

Likun Wang, Xiangteng Zhang, Yinuo Wang, Guojian Zhan, Wenxuan Wang, Haoyu Gao, Jingliang Duan, Shengbo Eben Li

TL;DR

This work introduces MoGE, a modular off-policy RL augmentation that uses a diffusion-based generator to create critical states and a one-step imagination world model to synthesize dynamics-consistent transitions, accelerating policy learning without altering core RL algorithms. By aligning the generator with a steady occupancy measure and guiding generation with policy- and value-driven utilities (e.g., policy entropy, TD error), MoGE continually expands exploration into high-potential regions while preserving Bellman consistency through the world model. The authors provide theoretical guarantees on occupancy alignment, a concrete off-policy training framework with mixture sampling, and extensive experiments showing significant improvements in sample efficiency and final performance on OpenAI Gym and DeepMind Control Suite. Overall, MoGE offers a practical, plug-in approach to exploration augmentation that couples task-aware state generation with reliable dynamics, enabling more efficient learning in complex control tasks.

Abstract

Exploration is fundamental to reinforcement learning (RL), as it determines how effectively an agent discovers and exploits the underlying structure of its environment to achieve optimal performance. Existing exploration methods generally fall into two categories: active exploration and passive exploration. The former introduces stochasticity into the policy but struggles in high-dimensional environments, while the latter adaptively prioritizes transitions in the replay buffer to enhance exploration, yet remains constrained by limited sample diversity. To address the limitation in passive exploration, we propose Modelic Generative Exploration (MoGE), which augments exploration through the generation of under-explored critical states and synthesis of dynamics-consistent experiences through transition models. MoGE is composed of two components: (1) a diffusion-based generator that synthesizes critical states under the guidance of a utility function evaluating each state's potential influence on policy exploration, and (2) a one-step imagination world model for constructing critical transitions based on the critical states for agent learning. Our method adopts a modular formulation that aligns with the principles of off-policy learning, allowing seamless integration with existing algorithms to improve exploration without altering their core structures. Empirical results on OpenAI Gym and DeepMind Control Suite reveal that MoGE effectively bridges exploration and policy learning, leading to remarkable gains in both sample efficiency and performance across complex control tasks.

Off-policy Reinforcement Learning with Model-based Exploration Augmentation

TL;DR

Abstract

Off-policy Reinforcement Learning with Model-based Exploration Augmentation

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (11)