Table of Contents
Fetching ...

Off-policy Reinforcement Learning with Model-based Exploration Augmentation

Likun Wang, Xiangteng Zhang, Yinuo Wang, Guojian Zhan, Wenxuan Wang, Haoyu Gao, Jingliang Duan, Shengbo Eben Li

TL;DR

This work introduces MoGE, a modular off-policy RL augmentation that uses a diffusion-based generator to create critical states and a one-step imagination world model to synthesize dynamics-consistent transitions, accelerating policy learning without altering core RL algorithms. By aligning the generator with a steady occupancy measure and guiding generation with policy- and value-driven utilities (e.g., policy entropy, TD error), MoGE continually expands exploration into high-potential regions while preserving Bellman consistency through the world model. The authors provide theoretical guarantees on occupancy alignment, a concrete off-policy training framework with mixture sampling, and extensive experiments showing significant improvements in sample efficiency and final performance on OpenAI Gym and DeepMind Control Suite. Overall, MoGE offers a practical, plug-in approach to exploration augmentation that couples task-aware state generation with reliable dynamics, enabling more efficient learning in complex control tasks.

Abstract

Exploration is fundamental to reinforcement learning (RL), as it determines how effectively an agent discovers and exploits the underlying structure of its environment to achieve optimal performance. Existing exploration methods generally fall into two categories: active exploration and passive exploration. The former introduces stochasticity into the policy but struggles in high-dimensional environments, while the latter adaptively prioritizes transitions in the replay buffer to enhance exploration, yet remains constrained by limited sample diversity. To address the limitation in passive exploration, we propose Modelic Generative Exploration (MoGE), which augments exploration through the generation of under-explored critical states and synthesis of dynamics-consistent experiences through transition models. MoGE is composed of two components: (1) a diffusion-based generator that synthesizes critical states under the guidance of a utility function evaluating each state's potential influence on policy exploration, and (2) a one-step imagination world model for constructing critical transitions based on the critical states for agent learning. Our method adopts a modular formulation that aligns with the principles of off-policy learning, allowing seamless integration with existing algorithms to improve exploration without altering their core structures. Empirical results on OpenAI Gym and DeepMind Control Suite reveal that MoGE effectively bridges exploration and policy learning, leading to remarkable gains in both sample efficiency and performance across complex control tasks.

Off-policy Reinforcement Learning with Model-based Exploration Augmentation

TL;DR

This work introduces MoGE, a modular off-policy RL augmentation that uses a diffusion-based generator to create critical states and a one-step imagination world model to synthesize dynamics-consistent transitions, accelerating policy learning without altering core RL algorithms. By aligning the generator with a steady occupancy measure and guiding generation with policy- and value-driven utilities (e.g., policy entropy, TD error), MoGE continually expands exploration into high-potential regions while preserving Bellman consistency through the world model. The authors provide theoretical guarantees on occupancy alignment, a concrete off-policy training framework with mixture sampling, and extensive experiments showing significant improvements in sample efficiency and final performance on OpenAI Gym and DeepMind Control Suite. Overall, MoGE offers a practical, plug-in approach to exploration augmentation that couples task-aware state generation with reliable dynamics, enabling more efficient learning in complex control tasks.

Abstract

Exploration is fundamental to reinforcement learning (RL), as it determines how effectively an agent discovers and exploits the underlying structure of its environment to achieve optimal performance. Existing exploration methods generally fall into two categories: active exploration and passive exploration. The former introduces stochasticity into the policy but struggles in high-dimensional environments, while the latter adaptively prioritizes transitions in the replay buffer to enhance exploration, yet remains constrained by limited sample diversity. To address the limitation in passive exploration, we propose Modelic Generative Exploration (MoGE), which augments exploration through the generation of under-explored critical states and synthesis of dynamics-consistent experiences through transition models. MoGE is composed of two components: (1) a diffusion-based generator that synthesizes critical states under the guidance of a utility function evaluating each state's potential influence on policy exploration, and (2) a one-step imagination world model for constructing critical transitions based on the critical states for agent learning. Our method adopts a modular formulation that aligns with the principles of off-policy learning, allowing seamless integration with existing algorithms to improve exploration without altering their core structures. Empirical results on OpenAI Gym and DeepMind Control Suite reveal that MoGE effectively bridges exploration and policy learning, leading to remarkable gains in both sample efficiency and performance across complex control tasks.

Paper Structure

This paper contains 35 sections, 5 theorems, 43 equations, 12 figures, 8 tables, 1 algorithm.

Key Result

Theorem 1

Let $\beta(a \mid s)$ be a behavior policy, $\pi^{*}(a \mid s)$ be a specific static policy, and let $\nu_{t}(s)$ and $d^{\pi^{*}}(s)$ represent the state occupancy measures under $\beta(a \mid s)$ and $\pi^*(a \mid s)$, respectively. Assuming that the divergence between the policy and $\pi^{*}(s)$

Figures (12)

  • Figure 1: Overview of MoGE. MoGE is composed of two sub-modules: a generator and a one-step world model. The generator produces critical states under-explored but potentially valuable for policy exploration under the guidance of policy and value function, while the one-step world model predicts the next state and reward to construct the transitions. The formulated exploratory can be mixed with real samples from the buffer to perform the policy improvement and evaluation.
  • Figure 2: Training curves on benchmarks. The solid lines depict the mean performance, while the shaded areas represent the confidence intervals over three seeds. The first row corresponds to the training curves on the DeepMind Control Suite, while the second row represents the results on OpenAI Gym.
  • Figure 3: Ablation study curves. We select the Humanoid-run task in DMC Suite with high complexity to perform all ablation experiments.
  • Figure 4: DMC environments.
  • Figure 5: Walker2d-v3
  • ...and 7 more figures

Theorems & Definitions (11)

  • Theorem 1: Steady‑State Occupancy Measurement Alignment Theorem
  • proof
  • Lemma 1: Discounted‑Occupancy Lipschitz Lemma
  • proof
  • Lemma 2: FIFO‑Buffer Proximity Lemma
  • proof
  • Lemma 3: Behaviour‑Mixing Contraction Lemma
  • proof
  • Lemma 4: The $\lambda$–mixture estimator bias
  • proof
  • ...and 1 more