Table of Contents
Fetching ...

Learning Hidden Subgoals under Temporal Ordering Constraints in Reinforcement Learning

Duo Xu, Faramarz Fekri

TL;DR

A novel RL algorithm which can effectively learn hidden subgoals (key states) and their temporal orderings at the same time, based on first-occupancy representation and temporal geometric sampling is proposed.

Abstract

In real-world applications, the success of completing a task is often determined by multiple key steps which are distant in time steps and have to be achieved in a fixed time order. For example, the key steps listed on the cooking recipe should be achieved one-by-one in the right time order. These key steps can be regarded as subgoals of the task and their time orderings are described as temporal ordering constraints. However, in many real-world problems, subgoals or key states are often hidden in the state space and their temporal ordering constraints are also unknown, which make it challenging for previous RL algorithms to solve this kind of tasks. In order to address this issue, in this work we propose a novel RL algorithm for {\bf l}earning hidden {\bf s}ubgoals under {\bf t}emporal {\bf o}rdering {\bf c}onstraints (LSTOC). We propose a new contrastive learning objective which can effectively learn hidden subgoals (key states) and their temporal orderings at the same time, based on first-occupancy representation and temporal geometric sampling. In addition, we propose a sample-efficient learning strategy to discover subgoals one-by-one following their temporal order constraints by building a subgoal tree to represent discovered subgoals and their temporal ordering relationships. Specifically, this tree can be used to improve the sample efficiency of trajectory collection, fasten the task solving and generalize to unseen tasks. The LSTOC framework is evaluated on several environments with image-based observations, showing its significant improvement over baseline methods.

Learning Hidden Subgoals under Temporal Ordering Constraints in Reinforcement Learning

TL;DR

A novel RL algorithm which can effectively learn hidden subgoals (key states) and their temporal orderings at the same time, based on first-occupancy representation and temporal geometric sampling is proposed.

Abstract

In real-world applications, the success of completing a task is often determined by multiple key steps which are distant in time steps and have to be achieved in a fixed time order. For example, the key steps listed on the cooking recipe should be achieved one-by-one in the right time order. These key steps can be regarded as subgoals of the task and their time orderings are described as temporal ordering constraints. However, in many real-world problems, subgoals or key states are often hidden in the state space and their temporal ordering constraints are also unknown, which make it challenging for previous RL algorithms to solve this kind of tasks. In order to address this issue, in this work we propose a novel RL algorithm for {\bf l}earning hidden {\bf s}ubgoals under {\bf t}emporal {\bf o}rdering {\bf c}onstraints (LSTOC). We propose a new contrastive learning objective which can effectively learn hidden subgoals (key states) and their temporal orderings at the same time, based on first-occupancy representation and temporal geometric sampling. In addition, we propose a sample-efficient learning strategy to discover subgoals one-by-one following their temporal order constraints by building a subgoal tree to represent discovered subgoals and their temporal ordering relationships. Specifically, this tree can be used to improve the sample efficiency of trajectory collection, fasten the task solving and generalize to unseen tasks. The LSTOC framework is evaluated on several environments with image-based observations, showing its significant improvement over baseline methods.

Paper Structure

This paper contains 31 sections, 7 equations, 15 figures, 1 table, 3 algorithms.

Figures (15)

  • Figure 1: (a) Example task. (b) The FSM for temporal dependencies of subgoals. Letters "c", "b", "w" and "d" are short for charger, board, wheel and diamond, respectively.
  • Figure 2: Examples of TL formulas and their corresponding FSMs. The initial node is $v_0$ and the accepting (terminal) node is $v_T$.
  • Figure 3: The subgoal tree representing the subgoal temporal dependencies $c;(b\vee w);d$.
  • Figure 4: Diagram of the LSTOC framework. For learning subgoals, $\mathcal{B}_P$ ($\mathcal{B}_N$) represents the buffer of positive (negative) trajectories, $f_{\theta}$ is the state representation function, $\hat{\mathcal{S}}_K$ is the set of discovered key states, ${\mathcal{T}}_{\varphi}$ is the subgoal tree, and $\pi_{\text{exp}}$ is the exploration policy. The trajectory collection is guided by ${\mathcal{T}}_{\varphi}$ and $\pi_{\text{exp}}$. In the labeling part, based on $\mathcal{M}_{\varphi}$, $\hat{\mathcal{S}}_K$ and ${\mathcal{T}}_{\varphi}$, the mapping from discovered key states to subgoal symbols is determined by solving an ILP problem, yielding the labeling function. $\mathcal{M}_{\varphi}$ denotes the FSM of temporal dependencies of subgoals in task $\varphi$.
  • Figure 5: Examples of building subgoal tree ${\mathcal{T}}_{\varphi}$. The temporal dependencies of subgoals can be expressed as $(a;b)\vee(b;c)$, whose FSM is shown in the rightmost figure. In the left three figures, the red node denotes the working node $v_w$, and $\hat{\mathcal{S}}_K$ is given on the upper left corner. The dashed nodes are unexplored nodes to be explored. Every node is labeled with a discovered key state and its index. The fourth figure shows a fully built ${\mathcal{T}}_{\varphi}$. The subgoals of the task are hidden and the agent only knows the result of task completion for each episode.
  • ...and 10 more figures