Table of Contents
Fetching ...

MRSD: Multi-Resolution Skill Discovery for HRL Agents

Shashank Sharma, Janina Hoffmann, Vinay Namboodiri

TL;DR

MRSD tackles long-horizon control by learning multiple skill encoders at temporal scales $l_i$ and a dynamic interleaving policy, enabling both fine- and coarse-grained control. It uses CVAE-based Skill encodings to model abstract state transitions and an exploratory objective based on reconstruction error to drive diverse skill discovery without external rewards. Empirical results on the DeepMind Control Suite show faster convergence and competitive final performance relative to state-of-the-art skill discovery and HRL baselines, with performance approaching non-HRL methods like DreamerV2 on several tasks. The approach offers a scalable path to versatile agents that combine multi-resolution skills for more efficient and flexible control, with well-characterized limitations and clear directions for future work.

Abstract

Hierarchical reinforcement learning (HRL) relies on abstract skills to solve long-horizon tasks efficiently. While existing skill discovery methods learns these skills automatically, they are limited to a single skill per task. In contrast, humans learn and use both fine-grained and coarse motor skills simultaneously. Inspired by human motor control, we propose Multi-Resolution Skill Discovery (MRSD), an HRL framework that learns multiple skill encoders at different temporal resolutions in parallel. A high-level manager dynamically selects among these skills, enabling adaptive control strategies over time. We evaluate MRSD on tasks from the DeepMind Control Suite and show that it outperforms prior state-of-the-art skill discovery and HRL methods, achieving faster convergence and higher final performance. Our findings highlight the benefits of integrating multi-resolution skills in HRL, paving the way for more versatile and efficient agents.

MRSD: Multi-Resolution Skill Discovery for HRL Agents

TL;DR

MRSD tackles long-horizon control by learning multiple skill encoders at temporal scales and a dynamic interleaving policy, enabling both fine- and coarse-grained control. It uses CVAE-based Skill encodings to model abstract state transitions and an exploratory objective based on reconstruction error to drive diverse skill discovery without external rewards. Empirical results on the DeepMind Control Suite show faster convergence and competitive final performance relative to state-of-the-art skill discovery and HRL baselines, with performance approaching non-HRL methods like DreamerV2 on several tasks. The approach offers a scalable path to versatile agents that combine multi-resolution skills for more efficient and flexible control, with well-characterized limitations and clear directions for future work.

Abstract

Hierarchical reinforcement learning (HRL) relies on abstract skills to solve long-horizon tasks efficiently. While existing skill discovery methods learns these skills automatically, they are limited to a single skill per task. In contrast, humans learn and use both fine-grained and coarse motor skills simultaneously. Inspired by human motor control, we propose Multi-Resolution Skill Discovery (MRSD), an HRL framework that learns multiple skill encoders at different temporal resolutions in parallel. A high-level manager dynamically selects among these skills, enabling adaptive control strategies over time. We evaluate MRSD on tasks from the DeepMind Control Suite and show that it outperforms prior state-of-the-art skill discovery and HRL methods, achieving faster convergence and higher final performance. Our findings highlight the benefits of integrating multi-resolution skills in HRL, paving the way for more versatile and efficient agents.

Paper Structure

This paper contains 28 sections, 10 equations, 18 figures, 1 table, 2 algorithms.

Figures (18)

  • Figure 1: Simulation of a simple point agent (star) in a 2D grid that moves towards assigned goal positions (crosses). Goal updates every fixed number of steps $K$ and alternates between $(x + l_i,1)$ and $(x+l_i,-1)$, where $x$ is the agent's current x-position and $l_i \in \{1,2,4,8\}$ is the skill length. Goal positions impact agent behavior based on their distance from the agent state. Closer goals lead to more controlled and precise movements, but can be susceptible to incorrect goals. Meanwhile, far away goals cause less deviation, leading to smooth but imprecise movements.
  • Figure 2: Illustrations of the abstract state transition-based control for the manager. Dashed arrows indicate sample propagation from the predicted distribution. (a) Skill CVAE, where the Encoder encodes initial and final states $(s_t,s_{t+l})$ to a latent skill space and the Decoder reconstructs the final state using the initial state $s_t$ and a sampled skill variable. (b) The manager predicts the latent skills and then uses the Decoder to generate goals for the worker.
  • Figure 3: Architectures for learning and acting using Multi-Resolution Skills ($l_i \in \{l_0,l_1,...,l_N\}$). Dashed arrows indicate sample propagation from the predicted distribution. Dashed boundaries indicate shared layers. (a) Separate CVAEs are learnt for each temporal resolution $l_i$. The $\text{Enc}$ and $\text{Dec}$ modules represent the common layers of the Encoders and the Decoders, respectively. Each $\text{Enc}_i$ is the resolution-specific encoder output layer, and each $\text{Dec}_i$ is the resolution-specific decoder input layer. (b) The manager's policy has $N+1$ output heads. $N$ skill heads $\pi_{M_i}$ that predict the resolution-specific skill latents and choice head $\pi_{M_C}$ that predicts an $N$-dimensional one-hot distribution. Samples from the skill latents are used to predict sug-goals using the respective Decoders, then the choice sample from $\pi_{M_C}$ selects one of the sub-goals as $s_g$ by gating.
  • Figure 4: Episode scores from MSRD (ours) and the Director ($3$ seeds per experiment). The plot shows the total rewards (mean and standard deviation) received in an episode against the environmental step. Both methods use the same common hyperparameters.
  • Figure 5: Stream graphs showing the evolution of the choice distribution during training averaged across $3$ seeds. A trend can be noticed that the manager starts with the $\infty$ skills but slowly switches to the temporally constrained skills.
  • ...and 13 more figures