Exploring the limits of Hierarchical World Models in Reinforcement Learning

Robin Schiewer; Anand Subramoney; Laurenz Wiskott

Exploring the limits of Hierarchical World Models in Reinforcement Learning

Robin Schiewer, Anand Subramoney, Laurenz Wiskott

TL;DR

The paper tackles sample-efficient planning in complex tasks by introducing a two-level hierarchical world-model framework (HMBRL) where each level maintains an RSSM-based model and hosts two agents: a reward-maximiser and a goal-seeker. Temporal abstraction is fixed and static, enabling concurrent training across levels while linking levels via abstract goals; abstract actions are produced through a beta-VAE to curb exploration in high-dimensional spaces. Experiments across Nav2d, PointMaze, Reacher, and HalfCheetah show that the hierarchy learns meaningful abstractions and can match a non-hierarchical baseline in some tasks, though performance is limited by model exploitation at the abstract level. The work highlights model exploitation as a key bottleneck and proposes directions such as discrete abstract actions and variable-length chunking to enhance grounding, stability, and generalisation in future HMBRL systems.

Abstract

Hierarchical model-based reinforcement learning (HMBRL) aims to combine the benefits of better sample efficiency of model based reinforcement learning (MBRL) with the abstraction capability of hierarchical reinforcement learning (HRL) to solve complex tasks efficiently. While HMBRL has great potential, it still lacks wide adoption. In this work we describe a novel HMBRL framework and evaluate it thoroughly. To complement the multi-layered decision making idiom characteristic for HRL, we construct hierarchical world models that simulate environment dynamics at various levels of temporal abstraction. These models are used to train a stack of agents that communicate in a top-down manner by proposing goals to their subordinate agents. A significant focus of this study is the exploration of a static and environment agnostic temporal abstraction, which allows concurrent training of models and agents throughout the hierarchy. Unlike most goal-conditioned H(MB)RL approaches, it also leads to comparatively low dimensional abstract actions. Although our HMBRL approach did not outperform traditional methods in terms of final episode returns, it successfully facilitated decision making across two levels of abstraction using compact, low dimensional abstract actions. A central challenge in enhancing our method's performance, as uncovered through comprehensive experimentation, is model exploitation on the abstract level of our world model stack. We provide an in depth examination of this issue, discussing its implications for the field and suggesting directions for future research to overcome this challenge. By sharing these findings, we aim to contribute to the broader discourse on refining HMBRL methodologies and to assist in the development of more effective autonomous learning systems for complex decision-making environments.

Exploring the limits of Hierarchical World Models in Reinforcement Learning

TL;DR

Abstract

Paper Structure (19 sections, 17 equations, 14 figures)

This paper contains 19 sections, 17 equations, 14 figures.

Introduction
Methods
Problem Setup
World Model
Temporal Abstraction
Goal and Similarity Measure
Agent Training
GSA Training
Model Exploitation
Experiments and Discussion
Test Environments
World Model Accuracy
Goal Representation and Similarity Measure
Agent Performance
Model Exploitation Experiments
...and 4 more sections

Figures (14)

Figure 1: RSSM rollout graph for three time steps and one step of initial padding, time steps are grouped by grey boxes and indicated at the bottom for clarity. Squares indicate deterministic and circles indicate stochastic variables, $\oplus$ denotes concatenation and the shaded 0th time step represents initial zero padding. The RSSM state $s_t$ is shown as deterministic variable to emphasize that Equations \ref{['eqn:rssm_state_cl']} and \ref{['eqn:rssm_state_ol']} are a deterministic operation once $z_t$ has been sampled. Ground truth information $x_t$ enters the RSSM via the encoder (green arrow), which implies $z_t^{post}$ is used for the next RSSM state. $x_t$ refers to the observation $o_t$ for the level zero RSSM and to the goal $g_t$ otherwise. If no ground truth information is available, $z_t^{prior}$ can fill the role of $z_t^{post}$ as shown in time step 3. During model training, $x_t$, $r_t$ and $d_t$ are reconstructed to $\hat{x}_t$, $\hat{r}_t$ and $\hat{d}_t$ via their decoders (blue arrows). To provide a strong training signal for the world model, they are compared to the ground truth $x_t$, $r_t$ and $d_t$. The third step is open loop, which means $\hat{x}_t$, $\hat{r}_t$ and $\hat{d}_t$ are reconstructed from $z_t^{prior}$.
Figure 2: An overview of our method using a 3 level model with temporal abstractions of 3 steps on level 1 and 2 steps on level 2. The level 0 model does not perform any temporal abstraction and simulates the ground truth data from the environment at full temporal resolution. Each node represents a time step on the respective level. In this example, as shown by the blue nodes, there are 6 steps of ground truth data available. Green nodes on the model levels indicate closed loop time steps that imply the model has been grounded via data from the real environment for these steps. The bottom-up information flow for model grounding is visualised by the blue arrows. Yellow nodes indicate open-loop steps that are not grounded. On Level 2, the RMA action (red arrow) generated a new open loop model state that spawned a goal, marked with the letter "g" on level 1. By navigating to this goal, the level 1 GSA executed two actions symbolised by the yellow arrows and produced two goals for the level 0 GSA. The GSA on level 0 has now two goals and a budget of 3 actions per goal to navigate towards them. Its immediate next action, which will also be executed in the real environment to collect data for time step 7, is indicated by the dotted yellow arrow. As soon as the level 0 GSA depleted its total action budget, the 6 data from the 6 newly collected environment steps will be used to ground the model levels 0 to 2 and equip the RMA on level 2 with up-to-date information for choosing the next action.
Figure 3: Trajectories from a hierarchical world model with three levels, each time step's data (observation, reward, terminal flag, model state, etc.) is represented by a square. The shaded areas indicate how many real world time steps the individual model levels span. Level 0 (red) is a special case that models the true MDPs dynamics with a temporal stride of $k^0=1$. Levels 1 (green) and 2 (blue) have a temporal stride of $k^1=4$ and $k^2=2$, respectively. The reward abstraction process for the 4th step on level 1 (more saturated colouring) is exemplified in detail on the right. The reward on the upper level is the mean of the lower level chunk ranging from steps 13 to 16. inclusive.
Figure 4: Left: The simple "Nav2d" navigation environment without obstacles or momentum is based on nav2d2019github. The agent (small green circle) can move in continuous steps across the area enclosed by the black rectangle and actions equal the displacement in x and y direction with a maximum step width limited to approximately three times the agent diameter. The agent receives a negative reward of -0.01 per step until it reaches the terminal region (large red circle), which ends the episode. Alternatively, the episode ends after a predefined number of steps, chosen to be 50. Right: The more complex "PointMaze" navigation environment features obstacles and momentum and sparse rewards. The agent (green ball) can roll around the environment in a continuous manner and is blocked by walls (orange). The actions equal the forces applied in x and y direction and the agent doesn't receive any reward unless it touches the red ball when it receives a reward of 1. This environment does not terminate once the agent and the red ball come into contact, but after a predefined number of 200 environment interactions. This makes it possible to collect more reward the sooner the red ball is reached by the agent.
Figure 5: Left: The Reacher environment emulates a simple robotics setting where a two-jointed robot arm's end effector (green ball) has to be moved to a given position (red ball). Environment observations comprise the joint angles, angular velocities and the distance to the red ball. The arm can only be moved in two dimensions by specifying torques to its individual joints. The reward is the sum of the Euclidean distance between the end effector and the target position and a control penalty that equals the norm of the action vector. Right: The "HalfCheetah" environment represents a high dimensional locomotion task with complex dynamics. Observations contain positions, velocities and angular velocities of the cheetah's joints. The task is to make the cheetah run as fast as possible to the right by applying torques to the hip, knee and ankle joints. The reward depends on the running velocity minus a penalty for control cost that depends on the magnitude of the applied torques.
...and 9 more figures

Exploring the limits of Hierarchical World Models in Reinforcement Learning

TL;DR

Abstract

Exploring the limits of Hierarchical World Models in Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (14)