Table of Contents
Fetching ...

HG2P: Hippocampus-inspired High-reward Graph and Model-Free Q-Gradient Penalty for Path Planning and Motion Control

Haoran Wang, Yaoru Sun, Zeshen Tang, Haibo Shi, Chenyuan Jiao

TL;DR

A hippocampus-striatum-like dual-controller hypothesis is proposed, proposing a high-return sampling strategy for constructing memory graphs, improving sample efficiency, and a model-free lower-level Q-function gradient penalty is derived to resolve the model dependency issues present in prior work.

Abstract

Goal-conditioned hierarchical reinforcement learning (HRL) decomposes complex reaching tasks into a sequence of simple subgoal-conditioned tasks, showing significant promise for addressing long-horizon planning in large-scale environments. This paper bridges the goal-conditioned HRL based on graph-based planning to brain mechanisms, proposing a hippocampus-striatum-like dual-controller hypothesis. Inspired by the brain mechanisms of organisms (i.e., the high-reward preferences observed in hippocampal replay) and instance-based theory, we propose a high-return sampling strategy for constructing memory graphs, improving sample efficiency. Additionally, we derive a model-free lower-level Q-function gradient penalty to resolve the model dependency issues present in prior work, improving the generalization of Lipschitz constraints in applications. Finally, we integrate these two extensions, High-reward Graph and model-free Gradient Penalty (HG2P), into the state-of-the-art framework ACLG, proposing a novel goal-conditioned HRL framework, HG2P+ACLG. Experimentally, the results demonstrate that our method outperforms state-of-the-art goal-conditioned HRL algorithms on a variety of long-horizon navigation tasks and robotic manipulation tasks.

HG2P: Hippocampus-inspired High-reward Graph and Model-Free Q-Gradient Penalty for Path Planning and Motion Control

TL;DR

A hippocampus-striatum-like dual-controller hypothesis is proposed, proposing a high-return sampling strategy for constructing memory graphs, improving sample efficiency, and a model-free lower-level Q-function gradient penalty is derived to resolve the model dependency issues present in prior work.

Abstract

Goal-conditioned hierarchical reinforcement learning (HRL) decomposes complex reaching tasks into a sequence of simple subgoal-conditioned tasks, showing significant promise for addressing long-horizon planning in large-scale environments. This paper bridges the goal-conditioned HRL based on graph-based planning to brain mechanisms, proposing a hippocampus-striatum-like dual-controller hypothesis. Inspired by the brain mechanisms of organisms (i.e., the high-reward preferences observed in hippocampal replay) and instance-based theory, we propose a high-return sampling strategy for constructing memory graphs, improving sample efficiency. Additionally, we derive a model-free lower-level Q-function gradient penalty to resolve the model dependency issues present in prior work, improving the generalization of Lipschitz constraints in applications. Finally, we integrate these two extensions, High-reward Graph and model-free Gradient Penalty (HG2P), into the state-of-the-art framework ACLG, proposing a novel goal-conditioned HRL framework, HG2P+ACLG. Experimentally, the results demonstrate that our method outperforms state-of-the-art goal-conditioned HRL algorithms on a variety of long-horizon navigation tasks and robotic manipulation tasks.

Paper Structure

This paper contains 29 sections, 3 theorems, 25 equations, 15 figures.

Key Result

Proposition 1

Given an MDP with the deterministic dynamics of the environment, where the deterministic policy $\pi(a_t, s_t)$ and the reward function $r(s_t, a_t)$ are differentiable over their respective input spaces. The differentiability property is satisfied by using the usual neural network-based approximato wWhere $N_s$ is defined as the dimension of the states and $\gamma$ is the discount factor.

Figures (15)

  • Figure 1: Schematic of the proposed hippocampus-striatum-like dual-controller hypothesis. Chersi et al. chersi2015cognitive proposed a minimal cognitive architecture for spatial navigation, comprising two principal mechanisms: (i) the hippocampus, which encodes environmental locations to support goal-directed decision making; and (ii) the striatum, which learns stimulus-response associations. Here, we further elaborate on the functional roles of the hippocampus and striatum modules by incorporating hippocampal replay and the involvement of the primary motor cortex (M1).The hypothesis builds upon the minimal cognitive architecture for spatial navigation proposed by Chersi et al chersi2015cognitive. We expand this architecture by detailing the roles of the hippocampus and striatum modules: (a) The hippocampus models place fields, similar to landmarks, based on place cells and grid cells. During training, hippocampal replay, particularly the reverse replay, prioritizes memories associated with high rewards singer2009rewardedambrose2016reversemurty2017selectivitymichon2019postmichon2021singleelliott2020neural, strong emotional responses, and novel experiences takeuchi2016locus; (b) The striatum functions as the higher-level policy in HRL. Furthermore, we detail the output pathway from the prefrontal cortex: motor signals are modulated by the primary motor cortex (M1), functioning like the lower-level policy in HRL, and are then transmitted along the spinal cord to the end effectors for motor control.
  • Figure 2: Overview of the proposed framework. The right part illustrates the architecture of goal-conditioned hierarchical RL, where the higher-level policy observes the environment and generates a high-level action (i.e., subgoal), while the lower-level policy works to reach the assigned subgoal. The left part details the graph‑building and planning process: coverage‑ and novelty‑based landmarks are selected to construct a topological map, from which the most urgent landmark is chosen as the next desired subgoal. The left-most part visually compares our high-return (HR) sampling with the commonly used uniform sampling in landmark-based planning. Remarkably, our graph contracts rapidly upon encountering high-value regions, similar to the behavior observed in the slime mold maze experiment nakagaki2000maze. The maze environment depicted in this diagram is discussed in detail in \ref{['embossed_u_maze']}.
  • Figure 3: Environments used in our experiments. The desired goal in each task is marked with a red arrow, and the black line represents a possible trajectory to reach the goal.
  • Figure 4: Ablation studies on hyperarameter selection in the Ant Push (first row) and Ant Maze (U-shape) (second row) tasks. The first, second, and third columns illustrate the learning curves of the proposed framework with varying numbers of landmarks $|\mathcal{S}_{LM}^{\rm cov} \cup \mathcal{S}_{LM}^{\rm nov}|$, varying balancing coefficient $\lambda^{\rm ACLG}_{\rm LM}$, and varying penalty coefficient $\lambda_{gp}$, respectively. The success rate is averaged over five random seeds.
  • Figure 5: The first column illustrates the performance of our method with varying temperature $\alpha$ in the \ref{['fig:antpush_wa']} Ant Push and \ref{['fig:antmaze_wa']} Ant Maze (U-shape) tasks. The remaining columns visualize the sampled landmarks, providing a qualitative analysis of the weighted sampling strategy at different temperatures $\alpha$. Here, we sample 1000 previously visited states (yellow balls), sparsify them to obtain 60 coverage-based landmarks (pink balls), and additionally sample 60 novelty-based landmarks (red balls).
  • ...and 10 more figures

Theorems & Definitions (6)

  • Proposition 1: Q-function Lipschitzness
  • proof
  • Corollary 1: Lower-level Q-function Lipschitzness
  • proof
  • Corollary 1: Lower-level Q-function Lipschitzness
  • proof