A Provably Efficient Option-Based Algorithm for both High-Level and Low-Level Learning
Gianluca Drappo, Alberto Maria Metelli, Marcello Restelli
TL;DR
This work tackles learning both high-level and low-level policies in the options-based HRL setting for finite-horizon problems. It introduces Options-UCBVI (O-UCBVI) for the high level and a meta-algorithm HLML that alternates between high- and low-level regret minimization to cope with non-stationarity from temporal abstractions. Theoretical results yield sublinear regret guarantees: for O-UCBVI, a bound of $Regret(\text{O-UCBVI},K) \le \tilde{O}( H \sqrt{SOKd} + H^3S^2Od + H\sqrt{Kd} )$, and for HLML, $R(\text{HLML},K) \le \tilde{O}( C^L H \sqrt{SOKd} + C^H H_O \sqrt{OSAKH_O} )$, under a structural assumption linking inner-option policies to flat-optimal policies. The results identify regimes where HRL is provably beneficial, notably when the product of the number of options and average duration reduces the effective planning horizon (i.e., $Od \ll AH$ and small $\sqrt{O\alpha^3}$ terms). Overall, the paper provides a principled, provably efficient framework for joint high- and low-level learning with temporally extended actions and clarifies the structural conditions that enable HRL to outperform flat approaches.
Abstract
Hierarchical Reinforcement Learning (HRL) approaches have shown successful results in solving a large variety of complex, structured, long-horizon problems. Nevertheless, a full theoretical understanding of this empirical evidence is currently missing. In the context of the \emph{option} framework, prior research has devised efficient algorithms for scenarios where options are fixed, and the high-level policy selecting among options only has to be learned. However, the fully realistic scenario in which both the high-level and the low-level policies are learned is surprisingly disregarded from a theoretical perspective. This work makes a step towards the understanding of this latter scenario. Focusing on the finite-horizon problem, we present a meta-algorithm alternating between regret minimization algorithms instanced at different (high and low) temporal abstractions. At the higher level, we treat the problem as a Semi-Markov Decision Process (SMDP), with fixed low-level policies, while at a lower level, inner option policies are learned with a fixed high-level policy. The bounds derived are compared with the lower bound for non-hierarchical finite-horizon problems, allowing to characterize when a hierarchical approach is provably preferable, even without pre-trained options.
