Hierarchical Subspaces of Policies for Continual Offline Reinforcement Learning
Anthony Kobanda, Rémy Portelas, Odalric-Ambrym Maillard, Ludovic Denoyer
TL;DR
This work tackles offline continual reinforcement learning for navigation by introducing HiSPO, a hierarchical framework that maintains two separate policy subspaces for high-level planning and low-level control. HiSPO uses anchor-based subspaces with pruning and extension, learned in an offline setting from expert data, and employs Dirichlet-based exploration of the subspace with a PAC-style criterion for zero-shot transfer. Through extensive experiments on MuJoCo maze tasks and Godot-based video-game-like environments, HiSPO achieves strong performance with significantly lower memory usage than full-expansion baselines, while mitigating forgetting and preserving generalization across task streams. The combination of hierarchical imitation learning and subspace-based adaptation provides a scalable, memory-efficient approach to offline CRL in dynamic navigation settings, with ablations demonstrating the value of two-subspace architecture and potential extensions via LoRA and PAC-based decisions.
Abstract
We consider a Continual Reinforcement Learning setup, where a learning agent must continuously adapt to new tasks while retaining previously acquired skill sets, with a focus on the challenge of avoiding forgetting past gathered knowledge and ensuring scalability with the growing number of tasks. Such issues prevail in autonomous robotics and video game simulations, notably for navigation tasks prone to topological or kinematic changes. To address these issues, we introduce HiSPO, a novel hierarchical framework designed specifically for continual learning in navigation settings from offline data. Our method leverages distinct policy subspaces of neural networks to enable flexible and efficient adaptation to new tasks while preserving existing knowledge. We demonstrate, through a careful experimental study, the effectiveness of our method in both classical MuJoCo maze environments and complex video game-like navigation simulations, showcasing competitive performances and satisfying adaptability with respect to classical continual learning metrics, in particular regarding the memory usage and efficiency.
