Table of Contents
Fetching ...

Reinforcement Learning via Auxiliary Task Distillation

Abhinav Narayan Harish, Larry Heck, Josiah P. Hanna, Zsolt Kira, Andrew Szot

TL;DR

It is demonstrated that AuxDistill can learn a pixels-to-actions policy for a challenging multi-stage embodied object rearrangement task from the environment reward from the environment reward without demonstrations, a learning curriculum, or pre-trained skills.

Abstract

We present Reinforcement Learning via Auxiliary Task Distillation (AuxDistill), a new method that enables reinforcement learning (RL) to perform long-horizon robot control problems by distilling behaviors from auxiliary RL tasks. AuxDistill achieves this by concurrently carrying out multi-task RL with auxiliary tasks, which are easier to learn and relevant to the main task. A weighted distillation loss transfers behaviors from these auxiliary tasks to solve the main task. We demonstrate that AuxDistill can learn a pixels-to-actions policy for a challenging multi-stage embodied object rearrangement task from the environment reward without demonstrations, a learning curriculum, or pre-trained skills. AuxDistill achieves $2.3 \times$ higher success than the previous state-of-the-art baseline in the Habitat Object Rearrangement benchmark and outperforms methods that use pre-trained skills and expert demonstrations.

Reinforcement Learning via Auxiliary Task Distillation

TL;DR

It is demonstrated that AuxDistill can learn a pixels-to-actions policy for a challenging multi-stage embodied object rearrangement task from the environment reward from the environment reward without demonstrations, a learning curriculum, or pre-trained skills.

Abstract

We present Reinforcement Learning via Auxiliary Task Distillation (AuxDistill), a new method that enables reinforcement learning (RL) to perform long-horizon robot control problems by distilling behaviors from auxiliary RL tasks. AuxDistill achieves this by concurrently carrying out multi-task RL with auxiliary tasks, which are easier to learn and relevant to the main task. A weighted distillation loss transfers behaviors from these auxiliary tasks to solve the main task. We demonstrate that AuxDistill can learn a pixels-to-actions policy for a challenging multi-stage embodied object rearrangement task from the environment reward without demonstrations, a learning curriculum, or pre-trained skills. AuxDistill achieves higher success than the previous state-of-the-art baseline in the Habitat Object Rearrangement benchmark and outperforms methods that use pre-trained skills and expert demonstrations.
Paper Structure (28 sections, 5 equations, 5 figures, 5 tables, 1 algorithm)

This paper contains 28 sections, 5 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: AuxDistill learns a rearrangement policy operating from egocentric depth perception and coordinate-based task specification. The full object rearrangement task decomposes into modular abilities that can be learned by auxiliary task with indicator vectors ${T}_{1} \cdots {T}_{N}$ which are trained along with the main task using end-to-end RL. The distillation loss is computed as a weighted combination of the task relevance of $o_t$ in the main task $T_0$ under all auxiliary tasks. The task relevance function computes $w_i(s_t)$ based on the relevance of the current observation and robot state to the auxiliary task $T_{i}$. The distillation loss and RL-training loss are then used to update the policy.
  • Figure 2: Comparison of skill-robustness to different choices of auxiliary task on the hard distribution. Including both the Pick and Pick from Fridge is crucial for successful rearrangement on this distribution. Not utilizing Open-Fridge leads to a boost in rearrangement success. This improvement arises because the open-fridge skill is the easiest of all auxiliary tasks and utilizing it reduces the number of samples for the main task (from the $100$M step budget). We discuss the auxiliary task learning curves in Appendix \ref{['sec:aux-task-learn']}
  • Figure 3: Left: RL training success rates on training episodes on the rearrangement task from \ref{['tab:rearrange']}. Note that AuxDistill (No Distill) and RL-Curriculum are displayed but achieve $0\%$ success throughout training. Right: Learning curve on the Category Pick task of AuxDistill utilizing coordinate pick distillation vs. monolithic RL. AuxDistill outperforms baselines in both settings. Displayed are averages and standard deviation over 3 random seeds.
  • Figure 4: Success curves of the individual skills for the main experiment reported in \ref{['tab:rearrange']}. The Open-Fridge Pick and Place skill shows high variance across seeds for the main method. RL-Curiculum shows higher variance on the Pick skill. Results are reported up to $250M$ steps number of training steps to show a comparison with all baselines (RL-Curicullum trains sub-skills only for the first stage)
  • Figure 5: Analyzing the curriculum of behaviors that emerges while training AuxDistill. On the left we compare the learning of the easier tasks followed by the harder tasks. On the right, we show a comparison of the main task ($\mathcal{M}_{0}$) learning with the relevant auxiliary tasks. On the left, the easier task learns first, followed by the harder task; on the right, the easier distribution improves only after learning the relevant auxiliary skills, i.e., Pick and Place begin learning.