Table of Contents
Fetching ...

Enhancing Hierarchical Reinforcement Learning through Change Point Detection in Time Series

Hemanath Arumugam, Falong Fan, Bo Liu

TL;DR

This work introduces a Transformer-based Change Point Detection (CPD) module integrated into the Option-Critic hierarchical reinforcement learning framework to automatically segment state trajectories and discover meaningful options. CPD provides supervision for termination, enables CPD-guided pretraining of intra-option policies via behavioral cloning, and enforces inter-option diversity across CPD-defined state partitions, all within a joint structure-aware optimization objective. Empirical results in discrete Four-Rooms and continuous Pinball domains show faster convergence, higher cumulative returns, and stronger option specialization, particularly after environmental regime changes. The approach demonstrates that incorporating learned temporal structure as a prior yields more interpretable, sample-efficient, and robust hierarchical policies for dynamic tasks.

Abstract

Hierarchical Reinforcement Learning (HRL) enhances the scalability of decision-making in long-horizon tasks by introducing temporal abstraction through options-policies that span multiple timesteps. Despite its theoretical appeal, the practical implementation of HRL suffers from the challenge of autonomously discovering semantically meaningful subgoals and learning optimal option termination boundaries. This paper introduces a novel architecture that integrates a self-supervised, Transformer-based Change Point Detection (CPD) module into the Option-Critic framework, enabling adaptive segmentation of state trajectories and the discovery of options. The CPD module is trained using heuristic pseudo-labels derived from intrinsic signals to infer latent shifts in environment dynamics without external supervision. These inferred change-points are leveraged in three critical ways: (i) to serve as supervisory signals for stabilizing termination function gradients, (ii) to pretrain intra-option policies via segment-wise behavioral cloning, and (iii) to enforce functional specialization through inter-option divergence penalties over CPD-defined state partitions. The overall optimization objective enhances the standard actor-critic loss using structure-aware auxiliary losses. In our framework, option discovery arises naturally as CPD-defined trajectory segments are mapped to distinct intra-option policies, enabling the agent to autonomously partition its behavior into reusable, semantically meaningful skills. Experiments on the Four-Rooms and Pinball tasks demonstrate that CPD-guided agents exhibit accelerated convergence, higher cumulative returns, and significantly improved option specialization. These findings confirm that integrating structural priors via change-point segmentation leads to more interpretable, sample-efficient, and robust hierarchical policies in complex environments.

Enhancing Hierarchical Reinforcement Learning through Change Point Detection in Time Series

TL;DR

This work introduces a Transformer-based Change Point Detection (CPD) module integrated into the Option-Critic hierarchical reinforcement learning framework to automatically segment state trajectories and discover meaningful options. CPD provides supervision for termination, enables CPD-guided pretraining of intra-option policies via behavioral cloning, and enforces inter-option diversity across CPD-defined state partitions, all within a joint structure-aware optimization objective. Empirical results in discrete Four-Rooms and continuous Pinball domains show faster convergence, higher cumulative returns, and stronger option specialization, particularly after environmental regime changes. The approach demonstrates that incorporating learned temporal structure as a prior yields more interpretable, sample-efficient, and robust hierarchical policies for dynamic tasks.

Abstract

Hierarchical Reinforcement Learning (HRL) enhances the scalability of decision-making in long-horizon tasks by introducing temporal abstraction through options-policies that span multiple timesteps. Despite its theoretical appeal, the practical implementation of HRL suffers from the challenge of autonomously discovering semantically meaningful subgoals and learning optimal option termination boundaries. This paper introduces a novel architecture that integrates a self-supervised, Transformer-based Change Point Detection (CPD) module into the Option-Critic framework, enabling adaptive segmentation of state trajectories and the discovery of options. The CPD module is trained using heuristic pseudo-labels derived from intrinsic signals to infer latent shifts in environment dynamics without external supervision. These inferred change-points are leveraged in three critical ways: (i) to serve as supervisory signals for stabilizing termination function gradients, (ii) to pretrain intra-option policies via segment-wise behavioral cloning, and (iii) to enforce functional specialization through inter-option divergence penalties over CPD-defined state partitions. The overall optimization objective enhances the standard actor-critic loss using structure-aware auxiliary losses. In our framework, option discovery arises naturally as CPD-defined trajectory segments are mapped to distinct intra-option policies, enabling the agent to autonomously partition its behavior into reusable, semantically meaningful skills. Experiments on the Four-Rooms and Pinball tasks demonstrate that CPD-guided agents exhibit accelerated convergence, higher cumulative returns, and significantly improved option specialization. These findings confirm that integrating structural priors via change-point segmentation leads to more interpretable, sample-efficient, and robust hierarchical policies in complex environments.

Paper Structure

This paper contains 28 sections, 16 equations, 5 figures, 3 algorithms.

Figures (5)

  • Figure 1: Workflow of CPD Integrated Option Critic framework
  • Figure 2: Agent's Learning Comparison Curve.
  • Figure 3: Comparison of average steps to reach the goal, including optimal steps.
  • Figure 4: Performance comparison of CPD-enabled Option Critic with Option Critic agent using 8 options.
  • Figure 5: Learning Curve Comparison of Option Critic agent and CPD-enabled Option Critic agents.