Table of Contents
Fetching ...

Identifying Selections for Unsupervised Subtask Discovery

Yiwen Qiu, Yujia Zheng, Kun Zhang

TL;DR

A theory to identify, and experiments to verify the existence of selection variables in such data serve as subgoals that indicate subtasks and guide policy, and a sequential non-negative matrix factorization method is developed to learn these subgoals and extract meaningful behavior patterns as subtasks.

Abstract

When solving long-horizon tasks, it is intriguing to decompose the high-level task into subtasks. Decomposing experiences into reusable subtasks can improve data efficiency, accelerate policy generalization, and in general provide promising solutions to multi-task reinforcement learning and imitation learning problems. However, the concept of subtasks is not sufficiently understood and modeled yet, and existing works often overlook the true structure of the data generation process: subtasks are the results of a $\textit{selection}$ mechanism on actions, rather than possible underlying confounders or intermediates. Specifically, we provide a theory to identify, and experiments to verify the existence of selection variables in such data. These selections serve as subgoals that indicate subtasks and guide policy. In light of this idea, we develop a sequential non-negative matrix factorization (seq- NMF) method to learn these subgoals and extract meaningful behavior patterns as subtasks. Our empirical results on a challenging Kitchen environment demonstrate that the learned subtasks effectively enhance the generalization to new tasks in multi-task imitation learning scenarios. The codes are provided at https://anonymous.4open.science/r/Identifying\_Selections\_for\_Unsupervised\_Subtask\_Discovery/README.md.

Identifying Selections for Unsupervised Subtask Discovery

TL;DR

A theory to identify, and experiments to verify the existence of selection variables in such data serve as subgoals that indicate subtasks and guide policy, and a sequential non-negative matrix factorization method is developed to learn these subgoals and extract meaningful behavior patterns as subtasks.

Abstract

When solving long-horizon tasks, it is intriguing to decompose the high-level task into subtasks. Decomposing experiences into reusable subtasks can improve data efficiency, accelerate policy generalization, and in general provide promising solutions to multi-task reinforcement learning and imitation learning problems. However, the concept of subtasks is not sufficiently understood and modeled yet, and existing works often overlook the true structure of the data generation process: subtasks are the results of a mechanism on actions, rather than possible underlying confounders or intermediates. Specifically, we provide a theory to identify, and experiments to verify the existence of selection variables in such data. These selections serve as subgoals that indicate subtasks and guide policy. In light of this idea, we develop a sequential non-negative matrix factorization (seq- NMF) method to learn these subgoals and extract meaningful behavior patterns as subtasks. Our empirical results on a challenging Kitchen environment demonstrate that the learned subtasks effectively enhance the generalization to new tasks in multi-task imitation learning scenarios. The codes are provided at https://anonymous.4open.science/r/Identifying\_Selections\_for\_Unsupervised\_Subtask\_Discovery/README.md.

Paper Structure

This paper contains 68 sections, 8 theorems, 24 equations, 15 figures, 5 tables, 3 algorithms.

Key Result

Proposition 1

(Sufficient condition) Assuming that the graphical representation is Markov and faithful to the measured data, if $\mathbf{s_t} \!\perp\!\!\!\!\not\perp\! \mathbf{a_t} \mid \mathbf{d_{t}}$, then $\mathbf{d_{t}}$ is a selection variable, i.e., $\mathbf{d_{t}} \coloneq \mathbf{g_{t}}$, under the assum

Figures (15)

  • Figure 1: Example of subgoals as selections. One subgoal is to "go picnicking", another subgoal is to "go to a movie". In order to "go picnicking", you need to go shopping first and then drive to the park; in order to "go to a movie", you need to check the movie information online first and then get the tickets. The actions caused us to accomplish the subtasks, and we essentially select the actions based on (conditioned on) the subgoals we want to achieve. On the contrary, weather is a confounder of the states and actions: changing our actions would not influence the weather, but actions influence whether we can achieve the subgoals.
  • Figure 2: Three kinds of dependency patterns of DAGs that we aim to distinguish. Structure (1) models the confounder case $\mathbf{s_t} \leftarrow \mathbf{c_{t}} \rightarrow \mathbf{a_t}$, structure (2) models the selection case $\mathbf{s_t} \rightarrow \mathbf{g_{t}} \leftarrow \mathbf{a_t}$, and structure (3) models the mediator case $\mathbf{s_t} \rightarrow \mathbf{m_{t}} \rightarrow \mathbf{a_t}$. In all three scenarios, the solid black arrows ($\rightarrow$) indicate the transition function that is invariant across different tasks. The dashed arrows ($\rightarrow$) indicate dependencies between nodes $\mathbf{d_{t}}$ and $\mathbf{d_{t+1}}$. We take them to be direct adjacencies in the main paper, and for potentially higher-order dependencies, we refer to Appx. \ref{['app:relaxation']}.
  • Figure 3: Figure (a) is the causal model for expert trajectories, which is further abstracted as the matrices in Figure (b), which can be learned by a seq-NMF algorithm. In both figures, data matrix $X$ is the aggregated $\{\mathbf{s_t}; \mathbf{a_t}\}_{t=1}^T$, and $\mathbf{H}\in\{0, 1\}^{J\times T}$ represents the binary subgoal matrix.
  • Figure 4: Patterns in $Color$-$3$ and -$10$.
  • Figure 5: Two tasks in Driving environment.
  • ...and 10 more figures

Theorems & Definitions (15)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Proposition 1
  • Proposition 2
  • Proposition 3
  • Remark 1
  • Remark 2
  • Definition 5
  • ...and 5 more