Hierarchical Average-Reward Linearly-solvable Markov Decision Processes

Guillermo Infante; Anders Jonsson; Vicenç Gómez

Hierarchical Average-Reward Linearly-solvable Markov Decision Processes

Guillermo Infante, Anders Jonsson, Vicenç Gómez

TL;DR

The paper addresses hierarchical reinforcement learning in the average-reward setting for Linearly-solvable MDPs (LMDPs), showing how to learn low-level and high-level tasks jointly without restrictive assumptions. It develops a hierarchical ALMDP framework built on state-space partitioning, subtask equivalence, and compositional value representation, enabling exact construction of the high-level value function from base subtasks. Two algorithms are proposed: a two-stage eigenvector method and an online algorithm that learns gains, subtasks, and exit values from samples; both leverage the linear structure of LMDPs and compositionality. Empirical results on N-room and Taxi domains demonstrate substantial speedups over flat ALMDP learning, validating the approach's efficiency and scalability for hierarchical average-reward tasks.

Abstract

We introduce a novel approach to hierarchical reinforcement learning for Linearly-solvable Markov Decision Processes (LMDPs) in the infinite-horizon average-reward setting. Unlike previous work, our approach allows learning low-level and high-level tasks simultaneously, without imposing limiting restrictions on the low-level tasks. Our method relies on partitions of the state space that create smaller subtasks that are easier to solve, and the equivalence between such partitions to learn more efficiently. We then exploit the compositionality of low-level tasks to exactly represent the value function of the high-level task. Experiments show that our approach can outperform flat average-reward reinforcement learning by one or several orders of magnitude.

Hierarchical Average-Reward Linearly-solvable Markov Decision Processes

TL;DR

Abstract

Paper Structure (17 sections, 12 theorems, 52 equations, 3 figures, 2 algorithms)

This paper contains 17 sections, 12 theorems, 52 equations, 3 figures, 2 algorithms.

Related work
Background
First-exit Linearly-solvable Markov Decision Processes
Hierarchical Decomposition for LMDPs
Average-reward Linearly-solvable Markov Decision Processes
Solving an ALMDP
Hierarchical Average-Reward LMDPs
Hierarchical Decomposition
Subtask Compositionality
Efficiency of the value representation
Algorithms
Eigenvector approach
Online algorithm
Experiments
Conclusion
...and 2 more sections

Key Result

Theorem 3

Under mild assumptions, differential soft TD-learning in eq:main_v_td_update and eq:main_rho_td_update converges to the optimal values of $v$ and $\rho$ in $\mathcal{L}$.

Figures (3)

Figure 1: a) An example $4$-room ALMDP; b) a single subtask with 5 terminal states $G,L,R,T,B$ that is equivalent to all 4 room subtasks. Rooms are numbered 1 through 4, left-to-right, then top-to-bottom, and exit state $1^B$ refers to the exit $B$ of room $1$, etc.
Figure 2: Results in N-room when varying the number of rooms and the size of the rooms.
Figure 3: Results for $5 \times 5$ (top) and $8 \times 8$ (bottom) grids of the Taxi domain.

Theorems & Definitions (23)

Definition 1
Theorem 3
proof : Proof sketch
Theorem 4
Lemma 5
proof
Lemma 5
proof
Corollary 6
proof
...and 13 more

Hierarchical Average-Reward Linearly-solvable Markov Decision Processes

TL;DR

Abstract

Hierarchical Average-Reward Linearly-solvable Markov Decision Processes

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (23)