Table of Contents
Fetching ...

Globally Optimal Hierarchical Reinforcement Learning for Linearly-Solvable Markov Decision Processes

Guillermo Infante, Anders Jonsson, Vicenç Gómez

TL;DR

This work extends linearly-solvable MDPs to a hierarchical setting by partitioning the state space into subtasks and leveraging subtask compositionality. The central idea is to express any subtask’s value function as a linear combination of base LMDPs, enabling zero-shot updates and a globally optimal policy without high-level non-stationarity. The paper introduces equivalence classes of subtasks, intra-task learning to share updates across base LMDPs, and an eigenvector-based solution for exit states, all while maintaining convergence guarantees under mild assumptions. The approach yields substantial sample efficiency by reducing problem size through hierarchy and compositionality, with theoretical guarantees and empirical validation in grid-world and Taxi-like domains. Overall, it provides a principled framework for globally optimal, hierarchically structured decision making in LMDPs, with practical benefits for transfer and scalability.

Abstract

In this work we present a novel approach to hierarchical reinforcement learning for linearly-solvable Markov decision processes. Our approach assumes that the state space is partitioned, and the subtasks consist in moving between the partitions. We represent value functions on several levels of abstraction, and use the compositionality of subtasks to estimate the optimal values of the states in each partition. The policy is implicitly defined on these optimal value estimates, rather than being decomposed among the subtasks. As a consequence, our approach can learn the globally optimal policy, and does not suffer from the non-stationarity of high-level decisions. If several partitions have equivalent dynamics, the subtasks of those partitions can be shared. If the set of boundary states is smaller than the entire state space, our approach can have significantly smaller sample complexity than that of a flat learner, and we validate this empirically in several experiments.

Globally Optimal Hierarchical Reinforcement Learning for Linearly-Solvable Markov Decision Processes

TL;DR

This work extends linearly-solvable MDPs to a hierarchical setting by partitioning the state space into subtasks and leveraging subtask compositionality. The central idea is to express any subtask’s value function as a linear combination of base LMDPs, enabling zero-shot updates and a globally optimal policy without high-level non-stationarity. The paper introduces equivalence classes of subtasks, intra-task learning to share updates across base LMDPs, and an eigenvector-based solution for exit states, all while maintaining convergence guarantees under mild assumptions. The approach yields substantial sample efficiency by reducing problem size through hierarchy and compositionality, with theoretical guarantees and empirical validation in grid-world and Taxi-like domains. Overall, it provides a principled framework for globally optimal, hierarchically structured decision making in LMDPs, with practical benefits for transfer and scalability.

Abstract

In this work we present a novel approach to hierarchical reinforcement learning for linearly-solvable Markov decision processes. Our approach assumes that the state space is partitioned, and the subtasks consist in moving between the partitions. We represent value functions on several levels of abstraction, and use the compositionality of subtasks to estimate the optimal values of the states in each partition. The policy is implicitly defined on these optimal value estimates, rather than being decomposed among the subtasks. As a consequence, our approach can learn the globally optimal policy, and does not suffer from the non-stationarity of high-level decisions. If several partitions have equivalent dynamics, the subtasks of those partitions can be shared. If the set of boundary states is smaller than the entire state space, our approach can have significantly smaller sample complexity than that of a flat learner, and we validate this empirically in several experiments.

Paper Structure

This paper contains 11 sections, 3 theorems, 14 equations, 3 figures, 1 algorithm.

Key Result

Lemma 2

If the reward of each terminal state $\tau\in\mathcal{T}_i$ equals its optimal value in $\mathcal{L}$, i.e. $z_i(\tau)=z(\tau)$, the optimal value of each non-terminal state $s\in\mathcal{S}_i$ equals its optimal value in $\mathcal{L}$, i.e. $z_i(s)=z(s)$.

Figures (3)

  • Figure 1: a) A 4-room LMDP, with a terminal state $F$ and 8 other exit states; b) a single subtask with 5 terminal states $F,L,R,T,B$ that is equivalent to all 4 room subtasks. Rooms are numbered 1 through 4, left-to-right, then top-to-bottom, and exit state $1^B$ refers to the exit $B$ of room $1$, etc.
  • Figure 2: Results for $3\times 3$ rooms of size $5 \times 5$ (left); $5\times 5$ rooms of size $3 \times 3$ (center); $8 \times 8$ rooms of size $5\times 5$ (right).
  • Figure 3: Results for Taxi for $5 \times 5$ and $10 \times 10$ (resp.) grids.

Theorems & Definitions (4)

  • Definition 1
  • Lemma 2
  • Lemma 3
  • Lemma 4