Table of Contents
Fetching ...

Offline Hierarchical Reinforcement Learning via Inverse Optimization

Carolin Schmidt, Daniele Gammelli, James Harrison, Marco Pavone, Filipe Rodrigues

TL;DR

OHIO is proposed: a framework for offline reinforcement learning (RL) of hierarchical policies that leverages knowledge of the policy structure to solve the inverse problem, recovering the unobservable high-level actions that likely generated the observed data under the authors' hierarchical policy.

Abstract

Hierarchical policies enable strong performance in many sequential decision-making problems, such as those with high-dimensional action spaces, those requiring long-horizon planning, and settings with sparse rewards. However, learning hierarchical policies from static offline datasets presents a significant challenge. Crucially, actions taken by higher-level policies may not be directly observable within hierarchical controllers, and the offline dataset might have been generated using a different policy structure, hindering the use of standard offline learning algorithms. In this work, we propose OHIO: a framework for offline reinforcement learning (RL) of hierarchical policies. Our framework leverages knowledge of the policy structure to solve the \textit{inverse problem}, recovering the unobservable high-level actions that likely generated the observed data under our hierarchical policy. This approach constructs a dataset suitable for off-the-shelf offline training. We demonstrate our framework on robotic and network optimization problems and show that it substantially outperforms end-to-end RL methods and improves robustness. We investigate a variety of instantiations of our framework, both in direct deployment of policies trained offline and when online fine-tuning is performed. Code and data are available at https://ohio-offline-hierarchical-rl.github.io

Offline Hierarchical Reinforcement Learning via Inverse Optimization

TL;DR

OHIO is proposed: a framework for offline reinforcement learning (RL) of hierarchical policies that leverages knowledge of the policy structure to solve the inverse problem, recovering the unobservable high-level actions that likely generated the observed data under the authors' hierarchical policy.

Abstract

Hierarchical policies enable strong performance in many sequential decision-making problems, such as those with high-dimensional action spaces, those requiring long-horizon planning, and settings with sparse rewards. However, learning hierarchical policies from static offline datasets presents a significant challenge. Crucially, actions taken by higher-level policies may not be directly observable within hierarchical controllers, and the offline dataset might have been generated using a different policy structure, hindering the use of standard offline learning algorithms. In this work, we propose OHIO: a framework for offline reinforcement learning (RL) of hierarchical policies. Our framework leverages knowledge of the policy structure to solve the \textit{inverse problem}, recovering the unobservable high-level actions that likely generated the observed data under our hierarchical policy. This approach constructs a dataset suitable for off-the-shelf offline training. We demonstrate our framework on robotic and network optimization problems and show that it substantially outperforms end-to-end RL methods and improves robustness. We investigate a variety of instantiations of our framework, both in direct deployment of policies trained offline and when online fine-tuning is performed. Code and data are available at https://ohio-offline-hierarchical-rl.github.io

Paper Structure

This paper contains 81 sections, 27 equations, 9 figures, 14 tables, 3 algorithms.

Figures (9)

  • Figure 1: We propose OHIO, a framework to learn hierarchical policies from offline data. By exploiting structural knowledge of the low-level policy, we solve an inverse problem (top center) to transform low-level trajectory data (top left) into a dataset amenable to offline RL (top right), regardless of the nature of the policy used for data collection. At inference time, the RL-trained policy provides inputs to the low-level policy (bottom).
  • Figure 2: Supply chain fine-tuning performance of OHIO (FT-OHIO) and end-to-end (FT-E2E) policies pre-trained on (a) sub-optimal (i.e., HEUR) and (b) biased data (i.e., MPC with biased forecast).
  • Figure 3: A graphical model for the system evolution, assuming Markovian dynamics and policies.
  • Figure 4: A diagram showing our inference procedure. Left: the observed-action case. Right: the case in which low-level actions are not observed.
  • Figure 5: Visualistion of high-level actions recovered by the analytical inverse for different LQR parameter settings
  • ...and 4 more figures

Theorems & Definitions (2)

  • Example 3.1: Linear-quadratic-Gaussian
  • Example 3.2: Analytical inverse: solving the inverse linear-quadratic problem