Table of Contents
Fetching ...

Convergence and stability of Q-learning in Hierarchical Reinforcement Learning

Massimiliano Manenti, Andrea Iannelli

TL;DR

This work addresses the lack of theoretical guarantees for hierarchical reinforcement learning by proposing Feudal Q-learning and proving its convergence and stability using Two-Timescale Stochastic Approximation and the ODE method. By modeling the high- and low-level updates as coupled dynamical systems, it shows that the learned Q-functions $(Q^{\mathrm{h}}, Q^{\mathrm{l}})$ converge almost surely to the Bellman-equilibrium pair $(Q^{\mathrm{h},*}, Q^{\mathrm{l},*})$ and remain bounded, even under interdependent updates. The authors also reveal a game-theoretic interpretation, showing the equilibrium aligns with Nash and Stackelberg concepts, which opens doors to further HRL design via game theory. Numerical experiments in Four Rooms MiniGrid corroborate the theory and demonstrate continual learning benefits, with Feudal Q-learning achieving comparable performance to flat Q-learning while accelerating adaptation to new goals.

Abstract

Hierarchical Reinforcement Learning promises, among other benefits, to efficiently capture and utilize the temporal structure of a decision-making problem and to enhance continual learning capabilities, but theoretical guarantees lag behind practice. In this paper, we propose a Feudal Q-learning scheme and investigate under which conditions its coupled updates converge and are stable. By leveraging the theory of Stochastic Approximation and the ODE method, we present a theorem stating the convergence and stability properties of Feudal Q-learning. This provides a principled convergence and stability analysis tailored to Feudal RL. Moreover, we show that the updates converge to a point that can be interpreted as an equilibrium of a suitably defined game, opening the door to game-theoretic approaches to Hierarchical RL. Lastly, experiments based on the Feudal Q-learning algorithm support the outcomes anticipated by theory.

Convergence and stability of Q-learning in Hierarchical Reinforcement Learning

TL;DR

This work addresses the lack of theoretical guarantees for hierarchical reinforcement learning by proposing Feudal Q-learning and proving its convergence and stability using Two-Timescale Stochastic Approximation and the ODE method. By modeling the high- and low-level updates as coupled dynamical systems, it shows that the learned Q-functions converge almost surely to the Bellman-equilibrium pair and remain bounded, even under interdependent updates. The authors also reveal a game-theoretic interpretation, showing the equilibrium aligns with Nash and Stackelberg concepts, which opens doors to further HRL design via game theory. Numerical experiments in Four Rooms MiniGrid corroborate the theory and demonstrate continual learning benefits, with Feudal Q-learning achieving comparable performance to flat Q-learning while accelerating adaptation to new goals.

Abstract

Hierarchical Reinforcement Learning promises, among other benefits, to efficiently capture and utilize the temporal structure of a decision-making problem and to enhance continual learning capabilities, but theoretical guarantees lag behind practice. In this paper, we propose a Feudal Q-learning scheme and investigate under which conditions its coupled updates converge and are stable. By leveraging the theory of Stochastic Approximation and the ODE method, we present a theorem stating the convergence and stability properties of Feudal Q-learning. This provides a principled convergence and stability analysis tailored to Feudal RL. Moreover, we show that the updates converge to a point that can be interpreted as an equilibrium of a suitably defined game, opening the door to game-theoretic approaches to Hierarchical RL. Lastly, experiments based on the Feudal Q-learning algorithm support the outcomes anticipated by theory.

Paper Structure

This paper contains 26 sections, 8 theorems, 61 equations, 4 figures, 1 table.

Key Result

Lemma 1

The ODE eq:3_ODE_h has a globally asymptotically stable equilibrium $\lambda(y)$. Moreover, $\lambda$ is a Lipschitz map.

Figures (4)

  • Figure 1: Block diagram of the Feudal RL scheme. Instead of having a flat policy interacting with the system, there is a high-level policy, $\pi^\mathrm{h}$, selecting goals for a low-level policy, $\pi^\mathrm{l}$, that interacts with the system $P$.
  • Figure 2: Time scales of the high-level and low-level MDPs. Circles represent the states of the real system, while black circles indicate the states where the high-level MDP makes a decision. In this example, the high-level MDP makes a decision every $T=3$ time steps of the low-level MDP.
  • Figure 3: First room of the Four Rooms Minigrid environment. The red triangle represents the initial pose of the agent, while the green square is the tile to reach. Grey tiles represent walls, while black tiles are walkable areas. Lit tiles help recognize the direction the agent is facing.
  • Figure 4: Cumulative reward per episode in the two trainings. The experiment clearly shows the convergent behaviour predicted by Theorem \ref{['thm:1_main_thm']}. Moreover, the HRL agent is converging to the same cumulative reward as the standard Q-learning agent.

Theorems & Definitions (16)

  • Lemma 1: Existence and property of the globally asymptotically stable equilibrium of $h$
  • Lemma 2: Existence of the globally asymptotically stable equilibrium of $g$
  • Theorem 1: Convergence and stability of Feudal Q-learning
  • proof
  • Definition 1: Nash equilibrium
  • Definition 2: Stackelberg equilibrium
  • Lemma 3
  • proof
  • Lemma 4: Lipschitz continuity of $g$ and $h$
  • proof
  • ...and 6 more