Bounded Exploration with World Model Uncertainty in Soft Actor-Critic Reinforcement Learning Algorithm
Ting Qiao, Henry Williams, David Valencia, Bruce MacDonald
TL;DR
This work tackles data inefficiency in reinforcement learning by proposing bounded exploration, which blends soft policy entropy with world-model uncertainty as intrinsic motivation without changing the reward function. The method samples multiple action candidates from a SAC policy, uses an ensemble of world models to estimate epistemic uncertainty, and selects the candidate with high uncertainty via a Gibbs-based ranking, then aligns with the SAC mean for stability. Empirical results on MuJoCo show improved data efficiency and highest scores in 6 of 8 tasks for the model-free SAC with bounded exploration, with mixed gains in model-based extensions. The study highlights practical benefits for real-world exploration under bounded criteria while noting limitations in complex environments and suggesting avenues for broader evaluation and theoretical grounding.
Abstract
One of the bottlenecks preventing Deep Reinforcement Learning algorithms (DRL) from real-world applications is how to explore the environment and collect informative transitions efficiently. The present paper describes bounded exploration, a novel exploration method that integrates both 'soft' and intrinsic motivation exploration. Bounded exploration notably improved the Soft Actor-Critic algorithm's performance and its model-based extension's converging speed. It achieved the highest score in 6 out of 8 experiments. Bounded exploration presents an alternative method to introduce intrinsic motivations to exploration when the original reward function has strict meanings.
