Table of Contents
Fetching ...

Look Before Leap: Look-Ahead Planning with Uncertainty in Reinforcement Learning

Yongshuai Liu, Xin Liu

TL;DR

The paper tackles the limitations of model-based RL caused by model bias and poor multi-step predictions in data-scarce regions. It introduces a two-phase framework that combines uncertainty-aware, $k$-step lookahead planning with an uncertainty-driven exploratory policy, supported by a theoretical bound that reveals a trade-off between model uncertainty and value-function approximation error. The method uses a Variational Bayes dynamic model with dropout to generate fixed-horizon fantasy trajectories and an RND-based intrinsic reward to guide exploration, improving both forward dynamics and policy performance. Empirical results on MuJoCo and Atari demonstrate superior sample efficiency and robustness across continuous and discrete, dense and sparse reward tasks, outperforming state-of-the-art baselines. The approach is scalable and broadly applicable to tasks with varying state/action spaces and reward structures.

Abstract

Model-based reinforcement learning (MBRL) has demonstrated superior sample efficiency compared to model-free reinforcement learning (MFRL). However, the presence of inaccurate models can introduce biases during policy learning, resulting in misleading trajectories. The challenge lies in obtaining accurate models due to limited diverse training data, particularly in regions with limited visits (uncertain regions). Existing approaches passively quantify uncertainty after sample generation, failing to actively collect uncertain samples that could enhance state coverage and improve model accuracy. Moreover, MBRL often faces difficulties in making accurate multi-step predictions, thereby impacting overall performance. To address these limitations, we propose a novel framework for uncertainty-aware policy optimization with model-based exploratory planning. In the model-based planning phase, we introduce an uncertainty-aware k-step lookahead planning approach to guide action selection at each step. This process involves a trade-off analysis between model uncertainty and value function approximation error, effectively enhancing policy performance. In the policy optimization phase, we leverage an uncertainty-driven exploratory policy to actively collect diverse training samples, resulting in improved model accuracy and overall performance of the RL agent. Our approach offers flexibility and applicability to tasks with varying state/action spaces and reward structures. We validate its effectiveness through experiments on challenging robotic manipulation tasks and Atari games, surpassing state-of-the-art methods with fewer interactions, thereby leading to significant performance improvements.

Look Before Leap: Look-Ahead Planning with Uncertainty in Reinforcement Learning

TL;DR

The paper tackles the limitations of model-based RL caused by model bias and poor multi-step predictions in data-scarce regions. It introduces a two-phase framework that combines uncertainty-aware, -step lookahead planning with an uncertainty-driven exploratory policy, supported by a theoretical bound that reveals a trade-off between model uncertainty and value-function approximation error. The method uses a Variational Bayes dynamic model with dropout to generate fixed-horizon fantasy trajectories and an RND-based intrinsic reward to guide exploration, improving both forward dynamics and policy performance. Empirical results on MuJoCo and Atari demonstrate superior sample efficiency and robustness across continuous and discrete, dense and sparse reward tasks, outperforming state-of-the-art baselines. The approach is scalable and broadly applicable to tasks with varying state/action spaces and reward structures.

Abstract

Model-based reinforcement learning (MBRL) has demonstrated superior sample efficiency compared to model-free reinforcement learning (MFRL). However, the presence of inaccurate models can introduce biases during policy learning, resulting in misleading trajectories. The challenge lies in obtaining accurate models due to limited diverse training data, particularly in regions with limited visits (uncertain regions). Existing approaches passively quantify uncertainty after sample generation, failing to actively collect uncertain samples that could enhance state coverage and improve model accuracy. Moreover, MBRL often faces difficulties in making accurate multi-step predictions, thereby impacting overall performance. To address these limitations, we propose a novel framework for uncertainty-aware policy optimization with model-based exploratory planning. In the model-based planning phase, we introduce an uncertainty-aware k-step lookahead planning approach to guide action selection at each step. This process involves a trade-off analysis between model uncertainty and value function approximation error, effectively enhancing policy performance. In the policy optimization phase, we leverage an uncertainty-driven exploratory policy to actively collect diverse training samples, resulting in improved model accuracy and overall performance of the RL agent. Our approach offers flexibility and applicability to tasks with varying state/action spaces and reward structures. We validate its effectiveness through experiments on challenging robotic manipulation tasks and Atari games, surpassing state-of-the-art methods with fewer interactions, thereby leading to significant performance improvements.

Paper Structure

This paper contains 20 sections, 1 theorem, 9 equations, 6 figures, 1 table, 2 algorithms.

Key Result

Theorem 1

(k-step lookahead policy) Suppose $f_\theta$ is an uncertainty-aware dynamics model with uncertainty variation for all states bounded by $\epsilon_f$. Let $V_\sigma$ be an approximate value function for extrinsic rewards satisfying $\max\limits_s|V_\sigma^{*}(s)-V_\sigma(s)| \le \epsilon_v$, where $

Figures (6)

  • Figure 1: The illustration of the look before leap framework.
  • Figure 2: The peformance of Walker2D, Hand Manipulate Block and BeamRider in different model-based settings.
  • Figure 3: The dynamic model prediction errors in Montezuma’s Revenge (MR)
  • Figure 4: The uncertainty-exploration intrinsic reward (e.g., RND) in MR
  • Figure 5: The mean number of rooms found in MR
  • ...and 1 more figures

Theorems & Definitions (1)

  • Theorem 1