Exploiting Hierarchy for Learning and Transfer in KL-regularized RL
Dhruva Tirumala, Hyeonwoo Noh, Alexandre Galashov, Leonard Hasenclever, Arun Ahuja, Greg Wayne, Razvan Pascanu, Yee Whye Teh, Nicolas Heess
TL;DR
The paper develops a KL-regularized RL framework that embeds latent-variable hierarchies into both the policy and a learned default policy, enabling richer inductive biases and modular transfer across tasks and bodies. By reparameterizing high-level latents and applying off-policy learning with Retrace, the authors derive tractable bounds that decompose KL regularization into high-level and low-level components, and introduce information-asymmetry and partial parameter sharing to control regularization. Empirically, structured hierarchical policies with latent defaults accelerate learning and improve transfer on diverse continuous-control and grid-world tasks, outperforming flat KL methods and traditional DISTRAL baselines, and enabling effective task and body transfers. The approach provides a general probabilistic modelling view of RL as hierarchical trajectory modelling, with practical algorithms for scalable, off-policy training and demonstrable gains in data efficiency and transfer capability.
Abstract
As reinforcement learning agents are tasked with solving more challenging and diverse tasks, the ability to incorporate prior knowledge into the learning system and to exploit reusable structure in solution space is likely to become increasingly important. The KL-regularized expected reward objective constitutes one possible tool to this end. It introduces an additional component, a default or prior behavior, which can be learned alongside the policy and as such partially transforms the reinforcement learning problem into one of behavior modelling. In this work we consider the implications of this framework in cases where both the policy and default behavior are augmented with latent variables. We discuss how the resulting hierarchical structures can be used to implement different inductive biases and how their modularity can benefit transfer. Empirically we find that they can lead to faster learning and transfer on a range of continuous control tasks.
