Table of Contents
Fetching ...

Multi-Horizon Representations with Hierarchical Forward Models for Reinforcement Learning

Trevor McInroe, Lukas Schäfer, Stefano V. Albrecht

TL;DR

HKSL introduces a hierarchical, multi-timescale latent forward-model framework for reinforcement learning from pixels, using a stack of forward models with varying step skips and a communication mechanism between levels, together with an ensemble of $n$-step critics. This design yields representations that capture task-relevant dynamics across timescales and improves sample efficiency, outperforming strong baselines on 30 DMControl tasks with and without distractors as well as a custom Falling Pixels task. The work demonstrates that hierarchical latent predictions and cross-level information sharing can organize environment information effectively, enabling faster learning and better robustness; however, it incurs additional computational cost and raises questions about automatic hierarchy tuning. Future directions include dynamic hierarchy adjustment and applying HKSL concepts to broader model-based RL, exploration, and planning tasks.

Abstract

Learning control from pixels is difficult for reinforcement learning (RL) agents because representation learning and policy learning are intertwined. Previous approaches remedy this issue with auxiliary representation learning tasks, but they either do not consider the temporal aspect of the problem or only consider single-step transitions, which may cause learning inefficiencies if important environmental changes take many steps to manifest. We propose Hierarchical $k$-Step Latent (HKSL), an auxiliary task that learns multiple representations via a hierarchy of forward models that learn to communicate and an ensemble of $n$-step critics that all operate at varying magnitudes of step skipping. We evaluate HKSL in a suite of 30 robotic control tasks with and without distractors and a task of our creation. We find that HKSL either converges to higher or optimal episodic returns more quickly than several alternative representation learning approaches. Furthermore, we find that HKSL's representations capture task-relevant details accurately across timescales (even in the presence of distractors) and that communication channels between hierarchy levels organize information based on both sides of the communication process, both of which improve sample efficiency.

Multi-Horizon Representations with Hierarchical Forward Models for Reinforcement Learning

TL;DR

HKSL introduces a hierarchical, multi-timescale latent forward-model framework for reinforcement learning from pixels, using a stack of forward models with varying step skips and a communication mechanism between levels, together with an ensemble of -step critics. This design yields representations that capture task-relevant dynamics across timescales and improves sample efficiency, outperforming strong baselines on 30 DMControl tasks with and without distractors as well as a custom Falling Pixels task. The work demonstrates that hierarchical latent predictions and cross-level information sharing can organize environment information effectively, enabling faster learning and better robustness; however, it incurs additional computational cost and raises questions about automatic hierarchy tuning. Future directions include dynamic hierarchy adjustment and applying HKSL concepts to broader model-based RL, exploration, and planning tasks.

Abstract

Learning control from pixels is difficult for reinforcement learning (RL) agents because representation learning and policy learning are intertwined. Previous approaches remedy this issue with auxiliary representation learning tasks, but they either do not consider the temporal aspect of the problem or only consider single-step transitions, which may cause learning inefficiencies if important environmental changes take many steps to manifest. We propose Hierarchical -Step Latent (HKSL), an auxiliary task that learns multiple representations via a hierarchy of forward models that learn to communicate and an ensemble of -step critics that all operate at varying magnitudes of step skipping. We evaluate HKSL in a suite of 30 robotic control tasks with and without distractors and a task of our creation. We find that HKSL either converges to higher or optimal episodic returns more quickly than several alternative representation learning approaches. Furthermore, we find that HKSL's representations capture task-relevant details accurately across timescales (even in the presence of distractors) and that communication channels between hierarchy levels organize information based on both sides of the communication process, both of which improve sample efficiency.
Paper Structure (20 sections, 8 equations, 16 figures, 3 tables, 1 algorithm)

This paper contains 20 sections, 8 equations, 16 figures, 3 tables, 1 algorithm.

Figures (16)

  • Figure 1: Depiction of HKSL architecture with an "unrolled" two-level hierarchical model where the first level moves at one step $n^{1}=1$ and the second level moves at three steps $n^{2}=3$. First, the online encoders $e_o$ (blue) encode the initial observation $o_1$ of the sampled trajectory. Next, the forward models $f$ (red) predict the latent representations of the following observations, with level 1 predicting single steps ahead conditioned on the level's previous representation and applied action. The forward model of the second level predicts three steps ahead and receives the previous representation and concatenation of the three applied actions. The communication manager $c$ (green) forwards information from the representations of the coarser second level to each forward model step of the first level as additional inputs. All models are trained end-to-end with a normalized $\ell_2$ loss of the difference between the projected representations of each level and timestep and the target representations of observations at the predicted timesteps. Target representations are obtained using momentum encoders $e_m$ (purple), and projections are done by the projection model $w$ (yellow) of the given level.
  • Figure 2: IQM (left) and optimality gap (middle) of evaluation returns at 100k environment steps, and IQM throughout training (right) across all 30 DMControl tasks. Shaded areas are 95% confidence intervals.
  • Figure 3: IQM and 95% CIs of evaluation returns for all algorithms in Falling Pixels (left) and ablations over HKSL's $h$ (right).
  • Figure 4: IQM 95% CIs of evaluation returns for HKSL ablations in Cartpole, Swingup (left), Ball in Cup, Catch (middle), and Walker, Walk (right).
  • Figure 5: MSE on task-relevant information in unseen episodes for Cartpole, Swingup (top) and Ball in Cup, Catch (bottom) at the 100k environment steps mark. Non-distraction, color distractor, and camera distractor settings shown from left-to-right. Lower is better.
  • ...and 11 more figures