Table of Contents
Fetching ...

An Autonomous Non-monolithic Agent with Multi-mode Exploration based on Options Framework

JaeYoon Kim, Junyu Xuan, Christy Liang, Farookh Hussain

TL;DR

The paper tackles when to explore in reinforcement learning by introducing an autonomous non-monolithic multi-mode exploration framework based on the options framework within a hierarchical RL setting. It presents a three-level architecture (Top/Middle/Low) where the top-level selects exploration modes, the middle level provides two exploration modes and an exploitation mode, and the low level executes actions. Key contributions include inherent switching of modes within the policy objective, entropy-aware mode diversification, guided exploration via reward modification, and online robustness evaluation, demonstrated to yield superior performance on Ant Push and Ant Fall compared with a non-monolithic baseline and HIRO, as evidenced by metrics on mode usage and transition behavior. The work advances autonomous mode switching without external heuristics and highlights the potential for human/animal-inspired exploration strategies in HRL, with $R_{final} = R + \alpha_{g^{expl-mode}} \cdot R$ guiding mode selection and $loss_{final} = loss + S_E \cdot loss$ promoting robust policy learning.

Abstract

Most exploration research on reinforcement learning (RL) has paid attention to `the way of exploration', which is `how to explore'. The other exploration research, `when to explore', has not been the main focus of RL exploration research. The issue of `when' of a monolithic exploration in the usual RL exploration behaviour binds an exploratory action to an exploitational action of an agent. Recently, a non-monolithic exploration research has emerged to examine the mode-switching exploration behaviour of humans and animals. The ultimate purpose of our research is to enable an agent to decide when to explore or exploit autonomously. We describe the initial research of an autonomous multi-mode exploration of non-monolithic behaviour in an options framework. The higher performance of our method is shown against the existing non-monolithic exploration method through comparative experimental results.

An Autonomous Non-monolithic Agent with Multi-mode Exploration based on Options Framework

TL;DR

The paper tackles when to explore in reinforcement learning by introducing an autonomous non-monolithic multi-mode exploration framework based on the options framework within a hierarchical RL setting. It presents a three-level architecture (Top/Middle/Low) where the top-level selects exploration modes, the middle level provides two exploration modes and an exploitation mode, and the low level executes actions. Key contributions include inherent switching of modes within the policy objective, entropy-aware mode diversification, guided exploration via reward modification, and online robustness evaluation, demonstrated to yield superior performance on Ant Push and Ant Fall compared with a non-monolithic baseline and HIRO, as evidenced by metrics on mode usage and transition behavior. The work advances autonomous mode switching without external heuristics and highlights the potential for human/animal-inspired exploration strategies in HRL, with guiding mode selection and promoting robust policy learning.

Abstract

Most exploration research on reinforcement learning (RL) has paid attention to `the way of exploration', which is `how to explore'. The other exploration research, `when to explore', has not been the main focus of RL exploration research. The issue of `when' of a monolithic exploration in the usual RL exploration behaviour binds an exploratory action to an exploitational action of an agent. Recently, a non-monolithic exploration research has emerged to examine the mode-switching exploration behaviour of humans and animals. The ultimate purpose of our research is to enable an agent to decide when to explore or exploit autonomously. We describe the initial research of an autonomous multi-mode exploration of non-monolithic behaviour in an options framework. The higher performance of our method is shown against the existing non-monolithic exploration method through comparative experimental results.
Paper Structure (23 sections, 9 equations, 5 figures, 1 table, 1 algorithm)

This paper contains 23 sections, 9 equations, 5 figures, 1 table, 1 algorithm.

Figures (5)

  • Figure 1: An example of noise-based monolithic exploration (left) and non-monolithic exploration (right). The final action, which is a scalar in this example for the well understanding explanation, denotes the action of an agent represented with a solid circle at each step. The solid line denotes the exploitation, an original action of a behaviour policy. The solid circle in the noise-based monolithic exploration is a final action which combines the original action of a behaviour policy and a sampled bounded noise at each step. However, the solid circle in the non-monolithic exploration is defined according to the mode of each step, i.e. exploitation which is an an original action of a behaviour policy or exploration which is a random noise or a policy.
  • Figure 2: The architecture of our suggested model (right) compared with that of the reference paper using a homeostasis 63 (left)
  • Figure 3: The count of exploration modes and exploitation and the reward and success rate of higher level policy for our model, Ref:Uniform random, Ref:PPO and HIRO in Ant Push
  • Figure 4: The count of exploration modes and exploitation and the reward and success rate of higher level policy for our model, Ref:Uniform random, Ref:PPO and HIRO in Ant Fall
  • Figure 5: Three types of ablation study against our normal model in Ant Push