MENTOR: Guiding Hierarchical Reinforcement Learning with Human Feedback and Dynamic Distance Constraint

Xinglin Zhou; Yifu Yuan; Shaofu Yang; Jianye Hao

MENTOR: Guiding Hierarchical Reinforcement Learning with Human Feedback and Dynamic Distance Constraint

Xinglin Zhou, Yifu Yuan, Shaofu Yang, Jianye Hao

TL;DR

This work proposes a general hierarchical reinforcement learning framework incorporating human feedback and dynamic distance constraints, termed MENTOR, which acts as a “mentor” and proposes the Dynamic Distance Constraint (DDC) mechanism dynamically adjusting the space of optional subgoals, such that MENTOR can generate subgoals matching the low-level policy learning process from easy to hard.

Abstract

Hierarchical reinforcement learning (HRL) provides a promising solution for complex tasks with sparse rewards of intelligent agents, which uses a hierarchical framework that divides tasks into subgoals and completes them sequentially. However, current methods struggle to find suitable subgoals for ensuring a stable learning process. Without additional guidance, it is impractical to rely solely on exploration or heuristics methods to determine subgoals in a large goal space. To address the issue, We propose a general hierarchical reinforcement learning framework incorporating human feedback and dynamic distance constraints (MENTOR). MENTOR acts as a "mentor", incorporating human feedback into high-level policy learning, to find better subgoals. As for low-level policy, MENTOR designs a dual policy for exploration-exploitation decoupling respectively to stabilize the training. Furthermore, although humans can simply break down tasks into subgoals to guide the right learning direction, subgoals that are too difficult or too easy can still hinder downstream learning efficiency. We propose the Dynamic Distance Constraint (DDC) mechanism dynamically adjusting the space of optional subgoals. Thus MENTOR can generate subgoals matching the low-level policy learning process from easy to hard. Extensive experiments demonstrate that MENTOR uses a small amount of human feedback to achieve significant improvement in complex tasks with sparse rewards.

MENTOR: Guiding Hierarchical Reinforcement Learning with Human Feedback and Dynamic Distance Constraint

TL;DR

Abstract

Paper Structure (26 sections, 17 equations, 21 figures, 3 tables, 3 algorithms)

This paper contains 26 sections, 17 equations, 21 figures, 3 tables, 3 algorithms.

Introduction
Related work
Hierarchical reinforcement learning
Reinforcement Learning from Human Feedback
Our Contribution
Preliminary
Problem Setting
Hindsight Relabelling
Curiosity-driven Exploration
Methodology
RLHF and Dynamic Distance Constraint in High-level
Subgoal generation using RLHF
Dynamic Distance Constraint for Subgoal Difficulty Adjustment
Exploration-Exploitation Decoupling in Low-level Policy
MENTOR Process
...and 11 more sections

Figures (21)

Figure 1: (a) The high-level policy selects subgoals with DDC (shades of green), and human guides by comparing these subgoals. (b) The low-level decouples exploration and exploitation through two policies, one policy explores the environment and the other learns from the experience of exploration. (c) Diagrammatic representation of MENTOR framework.
Figure 2: As the low-level capability improves, the DDC progressively relaxes, allowing the high-level to propose increasingly challenging subgoals.
Figure 4: Graphical representation of the success rates for MENTOR in comparison to other baseline methods across different benchmarks on five random seeds. The shaded areas surrounding each curve represent the standard deviation. Within the Four Rooms domain, the performance curve exhibits non-smooth behavior due to the fixed positions of the starting point and the goal. Consequently, the success rate can abruptly transition from 0% to 100%, leading to the curve with large variance. Any curves that are not visible in the graph, indicate a zero success rate throughout the trials. These results are aggregated from an average of five individual runs.
Figure 5: Impacts of Distance Constraints on success rate in FetchPickAndPlace and FetchPush domains. Since the high-level policy requires data to be collected before updating, a segment is missing from the distance curve.
Figure 6: Effects of the balancing coefficient on the environment goal success rate in FetchPickAndPlace domains are examined on five random seeds. In the first three graphs, the dashed lines represent the average success rate with auto-set $\alpha$ in the worst case, where $\Delta k$ is 0.02. The adjustment value of $k$ is represented as $\Delta k$. We modify the parameter $k$ to increase on successful completion of the subgoal and decrease on failure.
...and 16 more figures

MENTOR: Guiding Hierarchical Reinforcement Learning with Human Feedback and Dynamic Distance Constraint

TL;DR

Abstract

MENTOR: Guiding Hierarchical Reinforcement Learning with Human Feedback and Dynamic Distance Constraint

Authors

TL;DR

Abstract

Table of Contents

Figures (21)