Table of Contents
Fetching ...

Temporal Abstraction in Reinforcement Learning with Offline Data

Ranga Shaarad Ayyagari, Anurita Ghosh, Ambedkar Dukkipati

TL;DR

The paper tackles the problem of learning temporally abstracted policies from offline data for long-horizon RL. It introduces a general offline framework that converts online hierarchical RL into offline learning by building a pessimistic MDP ($P$-MDP) from the dataset and constraining the latent action space with a state-conditioned CVAE, allowing an online HRL algorithm to operate entirely offline. The authors validate the approach on MuJoCo locomotion tasks and robotic gripper block stacking across standard, transfer, and goal-conditioned settings, showing competitive or superior performance to existing offline methods such as MOReL, MOPO, and CQL, especially in transfer and goal-directed tasks. Ablation studies demonstrate the necessity of CVAE constraints, pessimistic termination, and goal-conditioning for robust offline hierarchical planning.

Abstract

Standard reinforcement learning algorithms with a single policy perform poorly on tasks in complex environments involving sparse rewards, diverse behaviors, or long-term planning. This led to the study of algorithms that incorporate temporal abstraction by training a hierarchy of policies that plan over different time scales. The options framework has been introduced to implement such temporal abstraction by learning low-level options that act as extended actions controlled by a high-level policy. The main challenge in applying these algorithms to real-world problems is that they suffer from high sample complexity to train multiple levels of the hierarchy, which is impossible in online settings. Motivated by this, in this paper, we propose an offline hierarchical RL method that can learn options from existing offline datasets collected by other unknown agents. This is a very challenging problem due to the distribution mismatch between the learned options and the policies responsible for the offline dataset and to our knowledge, this is the first work in this direction. In this work, we propose a framework by which an online hierarchical reinforcement learning algorithm can be trained on an offline dataset of transitions collected by an unknown behavior policy. We validate our method on Gym MuJoCo locomotion environments and robotic gripper block-stacking tasks in the standard as well as transfer and goal-conditioned settings.

Temporal Abstraction in Reinforcement Learning with Offline Data

TL;DR

The paper tackles the problem of learning temporally abstracted policies from offline data for long-horizon RL. It introduces a general offline framework that converts online hierarchical RL into offline learning by building a pessimistic MDP (-MDP) from the dataset and constraining the latent action space with a state-conditioned CVAE, allowing an online HRL algorithm to operate entirely offline. The authors validate the approach on MuJoCo locomotion tasks and robotic gripper block stacking across standard, transfer, and goal-conditioned settings, showing competitive or superior performance to existing offline methods such as MOReL, MOPO, and CQL, especially in transfer and goal-directed tasks. Ablation studies demonstrate the necessity of CVAE constraints, pessimistic termination, and goal-conditioning for robust offline hierarchical planning.

Abstract

Standard reinforcement learning algorithms with a single policy perform poorly on tasks in complex environments involving sparse rewards, diverse behaviors, or long-term planning. This led to the study of algorithms that incorporate temporal abstraction by training a hierarchy of policies that plan over different time scales. The options framework has been introduced to implement such temporal abstraction by learning low-level options that act as extended actions controlled by a high-level policy. The main challenge in applying these algorithms to real-world problems is that they suffer from high sample complexity to train multiple levels of the hierarchy, which is impossible in online settings. Motivated by this, in this paper, we propose an offline hierarchical RL method that can learn options from existing offline datasets collected by other unknown agents. This is a very challenging problem due to the distribution mismatch between the learned options and the policies responsible for the offline dataset and to our knowledge, this is the first work in this direction. In this work, we propose a framework by which an online hierarchical reinforcement learning algorithm can be trained on an offline dataset of transitions collected by an unknown behavior policy. We validate our method on Gym MuJoCo locomotion environments and robotic gripper block-stacking tasks in the standard as well as transfer and goal-conditioned settings.
Paper Structure (22 sections, 1 equation, 7 figures, 8 tables)

This paper contains 22 sections, 1 equation, 7 figures, 8 tables.

Figures (7)

  • Figure 1: The three high-level goals in the robotics environment grasping the blue block, placing it on the red block, and returning to the end position, respectively. These actions have to be taken sequentially, necessitating a hierarchical agent with a high-level planner.
  • Figure 2: Meta-algorithm for learning an offline version of a hierarchical online algorithm. The CVAE and the environment models are learned from the offline dataset. These are then used to train the HRL algorithm, which operates in a pessimistic approximation of the actual environment and the latent action space of the CVAE.
  • Figure 3: First row: results in Gym MuJoCo locomotion environments in the standard offline reinforcement learning setting. The captions specify the environment and the offline dataset on which the algorithms are learned. All the three algorithms are trained on the CVAE + P-MDP framework. Second row: results for the transfer task in Gym MuJoCo environments. The dip in the reward corresponds to the start of online training on the different tasks.
  • Figure 4: Results on the robotic-gripper block-stacking task for UOF along with SAC-HER and BC baselines, trained on the CVAE + P-MDP framework. The first and second rows depict the performance when trained on the Medium and Medium-Expert datasets respectively. Each plot shows the fraction of times the agents were able to reach the corresponding goal.
  • Figure 5: Results for the ablation experiments. The first row shows the ablation results of the UOF algorithm in the block-stacking task with and without a CVAE. The second row shows the results of the ablation of the pessimistic termination and the CVAE in the Hopper environment with the Medium offline dataset.
  • ...and 2 more figures