Table of Contents
Fetching ...

Bounded Exploration with World Model Uncertainty in Soft Actor-Critic Reinforcement Learning Algorithm

Ting Qiao, Henry Williams, David Valencia, Bruce MacDonald

TL;DR

This work tackles data inefficiency in reinforcement learning by proposing bounded exploration, which blends soft policy entropy with world-model uncertainty as intrinsic motivation without changing the reward function. The method samples multiple action candidates from a SAC policy, uses an ensemble of world models to estimate epistemic uncertainty, and selects the candidate with high uncertainty via a Gibbs-based ranking, then aligns with the SAC mean for stability. Empirical results on MuJoCo show improved data efficiency and highest scores in 6 of 8 tasks for the model-free SAC with bounded exploration, with mixed gains in model-based extensions. The study highlights practical benefits for real-world exploration under bounded criteria while noting limitations in complex environments and suggesting avenues for broader evaluation and theoretical grounding.

Abstract

One of the bottlenecks preventing Deep Reinforcement Learning algorithms (DRL) from real-world applications is how to explore the environment and collect informative transitions efficiently. The present paper describes bounded exploration, a novel exploration method that integrates both 'soft' and intrinsic motivation exploration. Bounded exploration notably improved the Soft Actor-Critic algorithm's performance and its model-based extension's converging speed. It achieved the highest score in 6 out of 8 experiments. Bounded exploration presents an alternative method to introduce intrinsic motivations to exploration when the original reward function has strict meanings.

Bounded Exploration with World Model Uncertainty in Soft Actor-Critic Reinforcement Learning Algorithm

TL;DR

This work tackles data inefficiency in reinforcement learning by proposing bounded exploration, which blends soft policy entropy with world-model uncertainty as intrinsic motivation without changing the reward function. The method samples multiple action candidates from a SAC policy, uses an ensemble of world models to estimate epistemic uncertainty, and selects the candidate with high uncertainty via a Gibbs-based ranking, then aligns with the SAC mean for stability. Empirical results on MuJoCo show improved data efficiency and highest scores in 6 of 8 tasks for the model-free SAC with bounded exploration, with mixed gains in model-based extensions. The study highlights practical benefits for real-world exploration under bounded criteria while noting limitations in complex environments and suggesting avenues for broader evaluation and theoretical grounding.

Abstract

One of the bottlenecks preventing Deep Reinforcement Learning algorithms (DRL) from real-world applications is how to explore the environment and collect informative transitions efficiently. The present paper describes bounded exploration, a novel exploration method that integrates both 'soft' and intrinsic motivation exploration. Bounded exploration notably improved the Soft Actor-Critic algorithm's performance and its model-based extension's converging speed. It achieved the highest score in 6 out of 8 experiments. Bounded exploration presents an alternative method to introduce intrinsic motivations to exploration when the original reward function has strict meanings.

Paper Structure

This paper contains 11 sections, 3 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: An oracle agent can follow the optimal path (Green line) that always has the highest rewards. In reality, an agent needs to find this path by trial-and-error, presumably in the (Smile faces within red dot lines) region. However, the agent can also pursue exploration bonuses ($\otimes$), resulting in a purple trajectory , which can be far away from the optimal path.
  • Figure 2: Soft Actor-Critic Updating Map ($\mathbf{L}$ is a loss function)
  • Figure 3: For each current state do: Step 1 (SAC Policy): the soft policy parameterize Normal distributions that $\mathbf{N}$ action candidates (e.g. $\mathbf{N}=100$) are sampled from, Step 2 (Uncertainty Estimation): feed-forward action candidates and states to an ensemble of world models (e.g. $\Omega_{\mathbf{M}=5}$) to compute uncertainty, Step 3 (Bounded Exploration): select the action causing the highest world-model uncertainty(selecting one action to execute from the $\mathbf{N}=100$).
  • Figure 4: Sample Actions from a Multi-variant Stochastic Policy (e.g. Hopper-v4): $\mathbf{N}=4$ samples in different colours were drawn from the distribution.
  • Figure 5: Mujoco Environments.
  • ...and 2 more figures