Learning Off-policy with Model-based Intrinsic Motivation For Active Online Exploration

Yibo Wang; Jiang Zhao

Learning Off-policy with Model-based Intrinsic Motivation For Active Online Exploration

Yibo Wang, Jiang Zhao

TL;DR

This work tackles the challenge of sample-efficient exploration in continuous-control RL by introducing the ACE planner, a hybrid model-based/off-policy framework. It leverages online planning in a latent state space, guided by a forward-predictive intrinsic reward and a novelty-aware terminal value function to address model uncertainty and sparse rewards. A multi-objective learning objective couples a deterministic latent dynamics model with an BYOL-style representation and an exponentially weighted MVE value target, enabling efficient credit assignment and robust off-policy learning. Theoretical performance bounds for the H-step lookahead with intrinsic rewards are complemented by extensive experiments across DMControl, Adroit, and Meta-World benchmarks, showing strong exploration, competitive asymptotic performance, and notable gains in sparse-reward settings. The approach offers a practical path toward real-world, sample-efficient reinforcement learning with planning-driven exploration and planning-centric representations.

Abstract

Recent advancements in deep reinforcement learning (RL) have demonstrated notable progress in sample efficiency, spanning both model-based and model-free paradigms. Despite the identification and mitigation of specific bottlenecks in prior works, the agent's exploration ability remains under-emphasized in the realm of sample-efficient RL. This paper investigates how to achieve sample-efficient exploration in continuous control tasks. We introduce an RL algorithm that incorporates a predictive model and off-policy learning elements, where an online planner enhanced by a novelty-aware terminal value function is employed for sample collection. Leveraging the forward predictive error within a latent state space, we derive an intrinsic reward without incurring parameters overhead. This reward establishes a solid connection to model uncertainty, allowing the agent to effectively overcome the asymptotic performance gap. Through extensive experiments, our method shows competitive or even superior performance compared to prior works, especially the sparse reward cases.

Learning Off-policy with Model-based Intrinsic Motivation For Active Online Exploration

TL;DR

Abstract

Paper Structure (27 sections, 5 theorems, 35 equations, 14 figures, 4 tables, 2 algorithms)

This paper contains 27 sections, 5 theorems, 35 equations, 14 figures, 4 tables, 2 algorithms.

Introduction
Preliminaries
Model-based Planning with Value Function
Model-based Value Function Approximation
Curiosity Augmented Latent Space Planning
Planning with Novelty-aware Value Function
Planning with Deterministic Policy Proposal
Learning Off-policy with Planning-centric Representation
Experimental Results
Illustration of Exploration
Continuous Control Problems
Analysis and Ablation Study
Future Work and Limitations
Conclusion
Proofs
...and 12 more sections

Key Result

Lemma 3.1

Let ${\hat{V}^{j}}$ be approximate value function with error ${\epsilon_{v}:= \max_{s} | \hat{V}^{j}(s) - \widetilde{V}^{\ast}(s) |}$, where ${\widetilde{V}^{\ast}}$ is the optimal value function for MDP ${\widetilde{\mathcal{M}}}$ with the modified reward function. Let ${\pi^{\hat{V}^{j}(s)}}$ be t

Figures (14)

Figure 1: Exploration illustration.(top) Exploration traces of agents (episodic length: 300) with state novelty assessed uniformly on the maze as a grid map. The state values, calculated as the mean of predefined velocities in all directions, highlight regions with high model uncertainty (depicted in red). (Bottom) Region coverage curve and cumulative model predictive error on the offline test set. Mean of 5 runs; shaded areas represent ${95\%}$ confidence intervals (CIs).
Figure 2: Learning progress in four representative tasks. The environment steps constraint for the humanoid-run task is relaxed to showcase the asymptotic performance achievable by each method. Mean of 5 seeds; shaded areas are $95\%$ confidence intervals.
Figure 3: Aggregated sparse reward performance. Normalized goal-reaching counts for Adroit and success rate for Meta-World as a function of environment steps, mean aggregated across individual tasks. Mean of 5 runs; shaded areas are 95% CIs. We refer to \ref{['fig:adorit_full_results']} and \ref{['fig:meta-world_full_results']} for the full results.
Figure 4: Relative importance of each design choice. The MVE-based target value is the most effective factor to further improve the final performance. Mean of 5 runs; shaded areas are $95$% CIs.
Figure 5: Average and std of the normalized estimation bias. We showcase the results of the representative task pair to reflect the subtle difference between dense and sparse reward cases. Mean of 5 runs; shaded areas are 95% CIs. We refer to \ref{['fig:estimate_bias_full']} for full ablation results.
...and 9 more figures

Theorems & Definitions (8)

Lemma 3.1
Corollary 3.2
proof
Lemma 1.1
proof
Lemma 1.2
Corollary 1.3
proof

Learning Off-policy with Model-based Intrinsic Motivation For Active Online Exploration

TL;DR

Abstract

Learning Off-policy with Model-based Intrinsic Motivation For Active Online Exploration

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (8)