Table of Contents
Fetching ...

Guided Exploration in Reinforcement Learning via Monte Carlo Critic Optimization

Igor Kuznetsov

TL;DR

This work tackles the limitations of random exploration in deep deterministic off-policy RL by introducing guided exploration through an Exploratory Module that leverages an ensemble of Monte Carlo critics to quantify uncertainty and generate directed action corrections. A novel MOCCO algorithm combines this exploration mechanism with a Monte Carlo-augmented critic loss, using the MC mean to temper Q-value overestimation while an on-policy exploratory correction guides action selection. Empirical results on the DMControl suite show that guided exploration improves over traditional noise-based methods, and MOCCO consistently outperforms major off-policy baselines (DDPG, TD3, SAC, and TD3-RND) with robust performance across tasks and modest hyperparameter sensitivity. The approach offers a practical, learnable mechanism for dynamic exploration that can enhance sample efficiency and performance in continuous control tasks, with potential extensions to model-based or memory-augmented exploration frameworks.

Abstract

The class of deep deterministic off-policy algorithms is effectively applied to solve challenging continuous control problems. Current approaches commonly utilize random noise as an exploration method, which has several drawbacks, including the need for manual adjustment for a given task and the absence of exploratory calibration during the training process. We address these challenges by proposing a novel guided exploration method that uses an ensemble of Monte Carlo Critics for calculating exploratory action correction. The proposed method enhances the traditional exploration scheme by dynamically adjusting exploration. Subsequently, we present a novel algorithm that leverages the proposed exploratory module for both policy and critic modification. The presented algorithm demonstrates superior performance compared to modern reinforcement learning algorithms across a variety of problems in the DMControl suite.

Guided Exploration in Reinforcement Learning via Monte Carlo Critic Optimization

TL;DR

This work tackles the limitations of random exploration in deep deterministic off-policy RL by introducing guided exploration through an Exploratory Module that leverages an ensemble of Monte Carlo critics to quantify uncertainty and generate directed action corrections. A novel MOCCO algorithm combines this exploration mechanism with a Monte Carlo-augmented critic loss, using the MC mean to temper Q-value overestimation while an on-policy exploratory correction guides action selection. Empirical results on the DMControl suite show that guided exploration improves over traditional noise-based methods, and MOCCO consistently outperforms major off-policy baselines (DDPG, TD3, SAC, and TD3-RND) with robust performance across tasks and modest hyperparameter sensitivity. The approach offers a practical, learnable mechanism for dynamic exploration that can enhance sample efficiency and performance in continuous control tasks, with potential extensions to model-based or memory-augmented exploration frameworks.

Abstract

The class of deep deterministic off-policy algorithms is effectively applied to solve challenging continuous control problems. Current approaches commonly utilize random noise as an exploration method, which has several drawbacks, including the need for manual adjustment for a given task and the absence of exploratory calibration during the training process. We address these challenges by proposing a novel guided exploration method that uses an ensemble of Monte Carlo Critics for calculating exploratory action correction. The proposed method enhances the traditional exploration scheme by dynamically adjusting exploration. Subsequently, we present a novel algorithm that leverages the proposed exploratory module for both policy and critic modification. The presented algorithm demonstrates superior performance compared to modern reinforcement learning algorithms across a variety of problems in the DMControl suite.
Paper Structure (12 sections, 15 equations, 9 figures, 2 tables, 1 algorithm)

This paper contains 12 sections, 15 equations, 9 figures, 2 tables, 1 algorithm.

Figures (9)

  • Figure 1: A scheme of a guided exploration model. Policy gradients $\nabla_\theta$ are directed by the final reward objective $J$ and corrected by an intrinsic objective $I$ facilitating directed exploration.
  • Figure 2: Preliminary motivation experiment. The performance of the original version of TD3 algorithm compared with a variant without exploration noise.
  • Figure 3: Visualization of uncertainty estimation (left) and critic prediction (right) at the same state on 2D action plane. Environment: point_mass-easy.
  • Figure 4: Visualization of exploratory action scaling.
  • Figure 5: An illustration of Q-value overestimation. The true value of $Q$-function (Q-true) lies between overestimated critic prediction (Q) and underestimated Monte Carlo prediction (Q-MC).
  • ...and 4 more figures