Table of Contents
Fetching ...

Granger Causal Interaction Skill Chains

Caleb Chuck, Kevin Black, Aditya Arjun, Yuke Zhu, Scott Niekum

TL;DR

The Chain of Interaction Skills (COInS) algorithm, which focuses on controllability in factored domains to identify a small number of task-agnostic skills that still permit a high degree of control, is introduced.

Abstract

Reinforcement Learning (RL) has demonstrated promising results in learning policies for complex tasks, but it often suffers from low sample efficiency and limited transferability. Hierarchical RL (HRL) methods aim to address the difficulty of learning long-horizon tasks by decomposing policies into skills, abstracting states, and reusing skills in new tasks. However, many HRL methods require some initial task success to discover useful skills, which paradoxically may be very unlikely without access to useful skills. On the other hand, reward-free HRL methods often need to learn far too many skills to achieve proper coverage in high-dimensional domains. In contrast, we introduce the Chain of Interaction Skills (COInS) algorithm, which focuses on controllability in factored domains to identify a small number of task-agnostic skills that still permit a high degree of control. COInS uses learned detectors to identify interactions between state factors and then trains a chain of skills to control each of these factors successively. We evaluate COInS on a robotic pushing task with obstacles -- a challenging domain where other RL and HRL methods fall short. We also demonstrate the transferability of skills learned by COInS, using variants of Breakout, a common RL benchmark, and show 2-3x improvement in both sample efficiency and final performance compared to standard RL baselines.

Granger Causal Interaction Skill Chains

TL;DR

The Chain of Interaction Skills (COInS) algorithm, which focuses on controllability in factored domains to identify a small number of task-agnostic skills that still permit a high degree of control, is introduced.

Abstract

Reinforcement Learning (RL) has demonstrated promising results in learning policies for complex tasks, but it often suffers from low sample efficiency and limited transferability. Hierarchical RL (HRL) methods aim to address the difficulty of learning long-horizon tasks by decomposing policies into skills, abstracting states, and reusing skills in new tasks. However, many HRL methods require some initial task success to discover useful skills, which paradoxically may be very unlikely without access to useful skills. On the other hand, reward-free HRL methods often need to learn far too many skills to achieve proper coverage in high-dimensional domains. In contrast, we introduce the Chain of Interaction Skills (COInS) algorithm, which focuses on controllability in factored domains to identify a small number of task-agnostic skills that still permit a high degree of control. COInS uses learned detectors to identify interactions between state factors and then trains a chain of skills to control each of these factors successively. We evaluate COInS on a robotic pushing task with obstacles -- a challenging domain where other RL and HRL methods fall short. We also demonstrate the transferability of skills learned by COInS, using variants of Breakout, a common RL benchmark, and show 2-3x improvement in both sample efficiency and final performance compared to standard RL baselines.
Paper Structure (39 sections, 13 equations, 9 figures, 6 tables, 1 algorithm)

This paper contains 39 sections, 13 equations, 9 figures, 6 tables, 1 algorithm.

Figures (9)

  • Figure 1: Left: The chain of COInS goal-based skills for Breakout, from primitive actions to final reward optimization. The goal space of the skills for one factor is the action space of the next factor controlling skill in the chain. COInS uses Granger-causal tests to detect interactions and construct edges between pairs of factors and their corresponding skills. Right: The Robot Pushing domain with negative reward regions. The objective is to push the block (red) from a random start position to a random goal (green) while avoiding the negative regions (shown in yellow, $-2$ reward), which are generated randomly in 15 grid spaces of a 5$\times$5 grid over the workspace.
  • Figure 2: Left: An illustration of the active and passive model inputs and outputs for the paddle-ball Granger models. Right: Three possible-interaction states. In Case 1, COInS predicts no interaction because both the passive and active models predict accurately. In Case 2, the active model predicts accurately using paddle information, but the passive model does not, indicating a paddle-ball interaction. Case 3 is not a paddle-ball interaction since both the passive and active models predict poorly.
  • Figure 3: Each algorithm is evaluated over 10 trials with the shaded region representing standard deviation. Training performance of COInS (blue) against baselines (see legend) on Breakout (left) and negative reward regions Robot Pushing (right). The vertical pink lines indicate when COInS starts learning a high-reward policy. In the pushing domain the return for not moving the block is $-30$. Most algorithms do not touch the block. The minimum total reward is $-600$---by spending every time step in a negative region. with R-HyPE's performance falls within this lowest bracket. Final evaluation of COInS against the baselines after $2M$ steps is found in Appendix Table \ref{['FullPerformanceTable']}.
  • Figure 4: Skill transfer with COInS (blue), training from scratch (orange), pre-train and fine-tune to the variant (green), and HyPE skills (C-HyPE in cyan, R-HyPE in purple). C-HyPE has nothing to optimize on transfer (it only learned a single "bounce" action), so average performance is visualized. Below the performance curves, the variants are visualized with the red (light) blocks as negative reward/unbreakable blocks, and the blue (dark) positive reward blocks. Details in Appendix \ref{['BaselineDetails']} and Section \ref{['transfer']}.
  • Figure 5: An illustration of the COInS hierarchy. At each edge, a policy controls the target factor to reach an interaction goal, and takes interaction goals as temporally extended actions. The child of the primitive actions passes primitive actions as pseudogoals. (lower) Learning a single layer of the HinTS hierarchy, the pairwise interaction between the paddle and the ball. The interaction detector and ball skill policy both use the paddle (parent) and ball (target) state as inputs and indicate if a goal is reached and a new temporally extended action is needed respectively. The ball skill policy uses paddle goals as actions and receives ball bounces as goals from the reward-optimizing policy.
  • ...and 4 more figures