Table of Contents
Fetching ...

Feudal Graph Reinforcement Learning

Tommaso Marzi, Arshjot Khehra, Andrea Cini, Cesare Alippi

TL;DR

Feudal Graph Reinforcement Learning introduces a pyramidal, graph-based hierarchy that learns modular policies across multiple levels of abstraction to address coordination bottlenecks in graph-based RL. By combining bottom-up state encoding with top-down goal propagation within a multi-level graph, workers, sub-managers, and a manager collaboratively decompose tasks and align local actions with global objectives. Empirical results on graph clustering and MuJoCo locomotion show competitive performance, with analysis of generated goals revealing coherent temporal structure that supports long-range coordination. This approach demonstrates a principled way to embed spatiotemporal abstraction into graph-based RL, enhancing transferability and scalability for structured, composable control problems.

Abstract

Graph-based representations and message-passing modular policies constitute prominent approaches to tackling composable control problems in reinforcement learning (RL). However, as shown by recent graph deep learning literature, such local message-passing operators can create information bottlenecks and hinder global coordination. The issue becomes more serious in tasks requiring high-level planning. In this work, we propose a novel methodology, named Feudal Graph Reinforcement Learning (FGRL), that addresses such challenges by relying on hierarchical RL and a pyramidal message-passing architecture. In particular, FGRL defines a hierarchy of policies where high-level commands are propagated from the top of the hierarchy down through a layered graph structure. The bottom layers mimic the morphology of the physical system, while the upper layers correspond to higher-order sub-modules. The resulting agents are then characterized by a committee of policies where actions at a certain level set goals for the level below, thus implementing a hierarchical decision-making structure that can naturally implement task decomposition. We evaluate the proposed framework on a graph clustering problem and MuJoCo locomotion tasks; simulation results show that FGRL compares favorably against relevant baselines. Furthermore, an in-depth analysis of the command propagation mechanism provides evidence that the introduced message-passing scheme favors learning hierarchical decision-making policies.

Feudal Graph Reinforcement Learning

TL;DR

Feudal Graph Reinforcement Learning introduces a pyramidal, graph-based hierarchy that learns modular policies across multiple levels of abstraction to address coordination bottlenecks in graph-based RL. By combining bottom-up state encoding with top-down goal propagation within a multi-level graph, workers, sub-managers, and a manager collaboratively decompose tasks and align local actions with global objectives. Empirical results on graph clustering and MuJoCo locomotion show competitive performance, with analysis of generated goals revealing coherent temporal structure that supports long-range coordination. This approach demonstrates a principled way to embed spatiotemporal abstraction into graph-based RL, enhancing transferability and scalability for structured, composable control problems.

Abstract

Graph-based representations and message-passing modular policies constitute prominent approaches to tackling composable control problems in reinforcement learning (RL). However, as shown by recent graph deep learning literature, such local message-passing operators can create information bottlenecks and hinder global coordination. The issue becomes more serious in tasks requiring high-level planning. In this work, we propose a novel methodology, named Feudal Graph Reinforcement Learning (FGRL), that addresses such challenges by relying on hierarchical RL and a pyramidal message-passing architecture. In particular, FGRL defines a hierarchy of policies where high-level commands are propagated from the top of the hierarchy down through a layered graph structure. The bottom layers mimic the morphology of the physical system, while the upper layers correspond to higher-order sub-modules. The resulting agents are then characterized by a committee of policies where actions at a certain level set goals for the level below, thus implementing a hierarchical decision-making structure that can naturally implement task decomposition. We evaluate the proposed framework on a graph clustering problem and MuJoCo locomotion tasks; simulation results show that FGRL compares favorably against relevant baselines. Furthermore, an in-depth analysis of the command propagation mechanism provides evidence that the introduced message-passing scheme favors learning hierarchical decision-making policies.
Paper Structure (37 sections, 13 equations, 10 figures, 3 tables)

This paper contains 37 sections, 13 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Constructing the agent graph $\mathcal{G}_1$ for 'Humanoid' environment. Blue squares in the agent's morphology represent the joints of the agent and are not mapped to nodes, differently from the green labels which, instead, refer to the limbs and constitute the nodes of $\mathcal{G}_1$.
  • Figure 2: Learning architecture given the hierarchical graph $\mathcal{G}^*$ and the graphs $\mathcal{G}_{l_h}$ for the 'Walker' environment. Trainable functions are reported in red and hierarchical operations are represented with dashed lines: in $\mathcal{G}^{\bm{*}}$, information flows bottom-up, while goals are assigned top-down.
  • Figure 3: Percentage of correct clustering (color) and median of (value) over $4$ independent runs. We remark that given a configuration $(\beta, N_\beta)$, all the models are trained on the same topology to ensure fairness.
  • Figure 4: Percentage of correct clustering (color) and median of score (value) over $4$ independent runs with long-term and one-step goals. We remark that given a configuration $(\beta, N_\beta)$, all the models are trained on the same topology to ensure fairness.
  • Figure 5: Average return and standard deviation of the considered agents on the MuJoCo benchmarks (averaged over $4$ runs). Each generation refers to a population of $64$ episodes. To ease the visualization, the plots show a running average with returns normalized w.r.t. the maximum obtained values, that are: 4025 (Humanoid), 2817 (Walker), 1918 (Half Cheetah), and 3175 (Hopper).
  • ...and 5 more figures