Table of Contents
Fetching ...

Hierarchical Subspaces of Policies for Continual Offline Reinforcement Learning

Anthony Kobanda, Rémy Portelas, Odalric-Ambrym Maillard, Ludovic Denoyer

TL;DR

This work tackles offline continual reinforcement learning for navigation by introducing HiSPO, a hierarchical framework that maintains two separate policy subspaces for high-level planning and low-level control. HiSPO uses anchor-based subspaces with pruning and extension, learned in an offline setting from expert data, and employs Dirichlet-based exploration of the subspace with a PAC-style criterion for zero-shot transfer. Through extensive experiments on MuJoCo maze tasks and Godot-based video-game-like environments, HiSPO achieves strong performance with significantly lower memory usage than full-expansion baselines, while mitigating forgetting and preserving generalization across task streams. The combination of hierarchical imitation learning and subspace-based adaptation provides a scalable, memory-efficient approach to offline CRL in dynamic navigation settings, with ablations demonstrating the value of two-subspace architecture and potential extensions via LoRA and PAC-based decisions.

Abstract

We consider a Continual Reinforcement Learning setup, where a learning agent must continuously adapt to new tasks while retaining previously acquired skill sets, with a focus on the challenge of avoiding forgetting past gathered knowledge and ensuring scalability with the growing number of tasks. Such issues prevail in autonomous robotics and video game simulations, notably for navigation tasks prone to topological or kinematic changes. To address these issues, we introduce HiSPO, a novel hierarchical framework designed specifically for continual learning in navigation settings from offline data. Our method leverages distinct policy subspaces of neural networks to enable flexible and efficient adaptation to new tasks while preserving existing knowledge. We demonstrate, through a careful experimental study, the effectiveness of our method in both classical MuJoCo maze environments and complex video game-like navigation simulations, showcasing competitive performances and satisfying adaptability with respect to classical continual learning metrics, in particular regarding the memory usage and efficiency.

Hierarchical Subspaces of Policies for Continual Offline Reinforcement Learning

TL;DR

This work tackles offline continual reinforcement learning for navigation by introducing HiSPO, a hierarchical framework that maintains two separate policy subspaces for high-level planning and low-level control. HiSPO uses anchor-based subspaces with pruning and extension, learned in an offline setting from expert data, and employs Dirichlet-based exploration of the subspace with a PAC-style criterion for zero-shot transfer. Through extensive experiments on MuJoCo maze tasks and Godot-based video-game-like environments, HiSPO achieves strong performance with significantly lower memory usage than full-expansion baselines, while mitigating forgetting and preserving generalization across task streams. The combination of hierarchical imitation learning and subspace-based adaptation provides a scalable, memory-efficient approach to offline CRL in dynamic navigation settings, with ablations demonstrating the value of two-subspace architecture and potential extensions via LoRA and PAC-based decisions.

Abstract

We consider a Continual Reinforcement Learning setup, where a learning agent must continuously adapt to new tasks while retaining previously acquired skill sets, with a focus on the challenge of avoiding forgetting past gathered knowledge and ensuring scalability with the growing number of tasks. Such issues prevail in autonomous robotics and video game simulations, notably for navigation tasks prone to topological or kinematic changes. To address these issues, we introduce HiSPO, a novel hierarchical framework designed specifically for continual learning in navigation settings from offline data. Our method leverages distinct policy subspaces of neural networks to enable flexible and efficient adaptation to new tasks while preserving existing knowledge. We demonstrate, through a careful experimental study, the effectiveness of our method in both classical MuJoCo maze environments and complex video game-like navigation simulations, showcasing competitive performances and satisfying adaptability with respect to classical continual learning metrics, in particular regarding the memory usage and efficiency.

Paper Structure

This paper contains 50 sections, 4 equations, 11 figures, 9 tables, 8 algorithms.

Figures (11)

  • Figure 1: Hierarchical Subspaces of Policies (HiSPO) :(a) Pruning and Extension mechanisms.Pruning involves optimizing anchor weights $\alpha$ within a defined simplex, allowing efficient exploration of the existing subspace. Extending introduces new anchors to expand the subspace, facilitating the adaptation to new tasks while keeping a compact representation of parameters. (b) The inference pipeline leveraging learned anchors. The high-level policy generates sub-goals, which the low-level policy follows by producing adequate actions. (c) Memory-efficient adaptation process. High-level and Low-level policy subspaces expand as new tasks introduce unknown changes, either Topological (affecting path planning) or Kinematic (affecting local actions).
  • Figure 2: Performance vs. Relative Memory Size. The figure shows the average performance w.r.t. memory size of different CRL methods over streams from the defined environments. HiSPO (star) demonstrates high performance with moderate memory usage. Notably as show the Figure (d), runs on random AntMaze tasks, our method is scalable and the resulting subspaces grow sublinearly.
  • Figure 3: Memory Usage on a AntMaze Stream.
  • Figure 4: All U (size =$5\times5$), M (size =$8\times8)$, and L (size =$12\times9$) mazes provide a sparse reward with a value of $1$ when the agent is within a $0.5$ unit radius to the goal. The Point Agent is a point mass controlled by applying forces in two dimensions, allowing the agent to move freely across the plane towards a goal location. In contrast the Ant Agent is a more complex articulated quadruped robot. It is controlled through the application of torques to its joints.
  • Figure 5: The SimpleTown (S) and the AmazeVille (AH, AL) environments : The naming indicate whether specific doors are open (O) or not (X), and if movable green blocks are in high positions (H) or low positions (L), providing a clear way to distinguish between different maze configurations.
  • ...and 6 more figures