An Autonomous Non-monolithic Agent with Multi-mode Exploration based on Options Framework
JaeYoon Kim, Junyu Xuan, Christy Liang, Farookh Hussain
TL;DR
The paper tackles when to explore in reinforcement learning by introducing an autonomous non-monolithic multi-mode exploration framework based on the options framework within a hierarchical RL setting. It presents a three-level architecture (Top/Middle/Low) where the top-level selects exploration modes, the middle level provides two exploration modes and an exploitation mode, and the low level executes actions. Key contributions include inherent switching of modes within the policy objective, entropy-aware mode diversification, guided exploration via reward modification, and online robustness evaluation, demonstrated to yield superior performance on Ant Push and Ant Fall compared with a non-monolithic baseline and HIRO, as evidenced by metrics on mode usage and transition behavior. The work advances autonomous mode switching without external heuristics and highlights the potential for human/animal-inspired exploration strategies in HRL, with $R_{final} = R + \alpha_{g^{expl-mode}} \cdot R$ guiding mode selection and $loss_{final} = loss + S_E \cdot loss$ promoting robust policy learning.
Abstract
Most exploration research on reinforcement learning (RL) has paid attention to `the way of exploration', which is `how to explore'. The other exploration research, `when to explore', has not been the main focus of RL exploration research. The issue of `when' of a monolithic exploration in the usual RL exploration behaviour binds an exploratory action to an exploitational action of an agent. Recently, a non-monolithic exploration research has emerged to examine the mode-switching exploration behaviour of humans and animals. The ultimate purpose of our research is to enable an agent to decide when to explore or exploit autonomously. We describe the initial research of an autonomous multi-mode exploration of non-monolithic behaviour in an options framework. The higher performance of our method is shown against the existing non-monolithic exploration method through comparative experimental results.
