Table of Contents
Fetching ...

Doubly Inhomogeneous Reinforcement Learning

Liyuan Hu, Mengbing Li, Chengchun Shi, Zhenke Wu, Piotr Fryzlewicz

TL;DR

This work tackles reinforcement learning in doubly inhomogeneous environments where dynamics evolve over time and vary across subjects. It introduces a data-rectangle identification framework that alternates most recent change point detection and clustering, enabling applying standard doubly homogeneous RL within each rectangle. The authors establish theoretical guarantees, including convergent change point and clustering errors and a regret bound with an oracle property, and validate the approach through simulations and a real-world Intern Health Study, showing meaningful gains in policy performance. The method is practically impactful for interventional mobile health and other settings with evolving, heterogeneous dynamics, enabling more efficient and tailored sequential decision-making. Key ideas include Local Stationarity at the Endpoint (LSE), Local Homogeneity at the Endpoint (LHE), and the use of an information criterion to adaptively choose the number of clusters, all within a flexible, plug-in RL framework.

Abstract

This paper studies reinforcement learning (RL) in doubly inhomogeneous environments under temporal non-stationarity and subject heterogeneity. In a number of applications, it is commonplace to encounter datasets generated by system dynamics that may change over time and population, challenging high-quality sequential decision making. Nonetheless, most existing RL solutions require either temporal stationarity or subject homogeneity, which would result in sub-optimal policies if both assumptions were violated. To address both challenges simultaneously, we propose an original algorithm to determine the ``best data chunks" that display similar dynamics over time and across individuals for policy learning, which alternates between most recent change point detection and cluster identification. Our method is general, and works with a wide range of clustering and change point detection algorithms. It is multiply robust in the sense that it takes multiple initial estimators as input and only requires one of them to be consistent. Moreover, by borrowing information over time and population, it allows us to detect weaker signals and has better convergence properties when compared to applying the clustering algorithm per time or the change point detection algorithm per subject. Empirically, we demonstrate the usefulness of our method through extensive simulations and a real data application.

Doubly Inhomogeneous Reinforcement Learning

TL;DR

This work tackles reinforcement learning in doubly inhomogeneous environments where dynamics evolve over time and vary across subjects. It introduces a data-rectangle identification framework that alternates most recent change point detection and clustering, enabling applying standard doubly homogeneous RL within each rectangle. The authors establish theoretical guarantees, including convergent change point and clustering errors and a regret bound with an oracle property, and validate the approach through simulations and a real-world Intern Health Study, showing meaningful gains in policy performance. The method is practically impactful for interventional mobile health and other settings with evolving, heterogeneous dynamics, enabling more efficient and tailored sequential decision-making. Key ideas include Local Stationarity at the Endpoint (LSE), Local Homogeneity at the Endpoint (LHE), and the use of an information criterion to adaptively choose the number of clusters, all within a flexible, plug-in RL framework.

Abstract

This paper studies reinforcement learning (RL) in doubly inhomogeneous environments under temporal non-stationarity and subject heterogeneity. In a number of applications, it is commonplace to encounter datasets generated by system dynamics that may change over time and population, challenging high-quality sequential decision making. Nonetheless, most existing RL solutions require either temporal stationarity or subject homogeneity, which would result in sub-optimal policies if both assumptions were violated. To address both challenges simultaneously, we propose an original algorithm to determine the ``best data chunks" that display similar dynamics over time and across individuals for policy learning, which alternates between most recent change point detection and cluster identification. Our method is general, and works with a wide range of clustering and change point detection algorithms. It is multiply robust in the sense that it takes multiple initial estimators as input and only requires one of them to be consistent. Moreover, by borrowing information over time and population, it allows us to detect weaker signals and has better convergence properties when compared to applying the clustering algorithm per time or the change point detection algorithm per subject. Empirically, we demonstrate the usefulness of our method through extensive simulations and a real data application.
Paper Structure (18 sections, 3 theorems, 8 equations, 3 figures, 6 tables)

This paper contains 18 sections, 3 theorems, 8 equations, 3 figures, 6 tables.

Key Result

Theorem 1

Suppose MA, LSE, LHE, Assumptions as:tau--as:ergodic in the Supplementary Materials hold and that as $T\to \infty$, the initial estimators satisfy $\max_i (\tau_i^0-\tau_i^*)_+/\tau_i^*\ll T^{-1/2}\sqrt{\log(NT)}$ and $\min_i \tau_i^0\ge \kappa T$ for some constant $\kappa>0$, wpa1; Then at each ite

Figures (3)

  • Figure 1: Basic building blocks with two subjects (one in each row) and a single change point. Different transition functions are represented by different colours.
  • Figure 2: Two additional doubly inhomogeneous environments. The top panel visualises an asynchronous evolution example where two subjects evolve at different time points asynchronously. The bottom panel visualises a split, merge and evolution example where an initial cluster first splits, then parts of it merge with another, evolving subsequently into a new one. In both panels, the best data rectangles with same dynamics over time and population are highlighted with bold borders. In particular in the bottom panel, subjects 1 and 3 evolve to a new shared dynamic at different time points and form a cluster. The best data rectangle of this cluster begins at the most recent change point $T - t_2^*$.
  • Figure 3: Left panel: The individual sleep duration trajectories for the detected two clusters in the IHS dataset, along with their cluster-specific change points. The red vertical lines visualise the change points. The green and yellow horizontal lines report the cluster-specific average sleep duration before and after the change points. Right panel: Examples of abrupt and smooth change points occurring at 50 and 40, respectively.

Theorems & Definitions (3)

  • Theorem 1
  • Corollary 1
  • Theorem 2