Table of Contents
Fetching ...

Deep Transfer $Q$-Learning for Offline Non-Stationary Reinforcement Learning

Jinhang Chai, Elynn Chen, Jianqing Fan

TL;DR

This paper pioneers the study of transfer learning for dynamic decision scenarios modeled by non-stationary finite-horizon Markov decision processes, utilizing neural networks as powerful function approximators and backward inductive learning.

Abstract

In dynamic decision-making scenarios across business and healthcare, leveraging sample trajectories from diverse populations can significantly enhance reinforcement learning (RL) performance for specific target populations, especially when sample sizes are limited. While existing transfer learning methods primarily focus on linear regression settings, they lack direct applicability to reinforcement learning algorithms. This paper pioneers the study of transfer learning for dynamic decision scenarios modeled by non-stationary finite-horizon Markov decision processes, utilizing neural networks as powerful function approximators and backward inductive learning. We demonstrate that naive sample pooling strategies, effective in regression settings, fail in Markov decision processes.To address this challenge, we introduce a novel ``re-weighted targeting procedure'' to construct ``transferable RL samples'' and propose ``transfer deep $Q^*$-learning'', enabling neural network approximation with theoretical guarantees. We assume that the reward functions are transferable and deal with both situations in which the transition densities are transferable or nontransferable. Our analytical techniques for transfer learning in neural network approximation and transition density transfers have broader implications, extending to supervised transfer learning with neural networks and domain shift scenarios. Empirical experiments on both synthetic and real datasets corroborate the advantages of our method, showcasing its potential for improving decision-making through strategically constructing transferable RL samples in non-stationary reinforcement learning contexts.

Deep Transfer $Q$-Learning for Offline Non-Stationary Reinforcement Learning

TL;DR

This paper pioneers the study of transfer learning for dynamic decision scenarios modeled by non-stationary finite-horizon Markov decision processes, utilizing neural networks as powerful function approximators and backward inductive learning.

Abstract

In dynamic decision-making scenarios across business and healthcare, leveraging sample trajectories from diverse populations can significantly enhance reinforcement learning (RL) performance for specific target populations, especially when sample sizes are limited. While existing transfer learning methods primarily focus on linear regression settings, they lack direct applicability to reinforcement learning algorithms. This paper pioneers the study of transfer learning for dynamic decision scenarios modeled by non-stationary finite-horizon Markov decision processes, utilizing neural networks as powerful function approximators and backward inductive learning. We demonstrate that naive sample pooling strategies, effective in regression settings, fail in Markov decision processes.To address this challenge, we introduce a novel ``re-weighted targeting procedure'' to construct ``transferable RL samples'' and propose ``transfer deep -learning'', enabling neural network approximation with theoretical guarantees. We assume that the reward functions are transferable and deal with both situations in which the transition densities are transferable or nontransferable. Our analytical techniques for transfer learning in neural network approximation and transition density transfers have broader implications, extending to supervised transfer learning with neural networks and domain shift scenarios. Empirical experiments on both synthetic and real datasets corroborate the advantages of our method, showcasing its potential for improving decision-making through strategically constructing transferable RL samples in non-stationary reinforcement learning contexts.
Paper Structure (35 sections, 11 theorems, 136 equations, 4 figures, 1 table, 4 algorithms)

This paper contains 35 sections, 11 theorems, 136 equations, 4 figures, 1 table, 4 algorithms.

Key Result

Theorem 8

Consider the transfer RL setting with $K+1$ finite-horizon non-stationary MDPs: ${\cal M}^{(k)} = \left\{ {\cal S}, {\cal A}, P^{(k)}, r^{(k)}, \gamma, T \right\}$ for $k\in\{0\}\cup[K]$. Let $\widehat{Q}_t^{\rm tr}$ denote the estimator obtained by Algorithm alg:rwt-trans-q-general with DNN appr where $J:=\left\lvert{\cal A}\right\rvert$, $n_0$ and $n_{{\cal M}}$ are the number of trajectori

Figures (4)

  • Figure 1: Experimental workflow: Phase (a) collects initial target data using uniform random policies. Phase (b) applies RWT Transfer $Q$-learning using both target and source data. Phase (c) conducts on-policy evaluation of the derived greedy policy from $\widehat{Q}^{\rm (tr)}$ in the target environment.
  • Figure 2: Cumulative regrets (left) and rewards (right) of the online evaluation phase with or without transfer, following the scheme illustrated in Figure \ref{['fig:offline-online']}. The offline source data set has $10,000$ trajectories. The cumulative regrets and rewards is shown as a function of the "Target Sample Size", corresponding to the amount of target data collected in Phase (a). The online evaluation phase deploys the greedy policy for both with or without transfer.
  • Figure 3: Cumulative rewards of the online evaluation phase with or without transfer in the MIMIC-3 calibrated environments, averaged over $1,000$ trajectories and following the scheme illustrated in Figure \ref{['fig:offline-online']}. The values of the purple line are nearly identically across different target sample sizes, with differences only appearing in the third decimal place. The offline source data sets has $10,000$ trajectories. The x-axis titled "Target Sample Size" represents the number of target data sampled in the phase of initial target data collection. The online evaluation deploys the greedy policy for both with or without transfer.
  • Figure 4: Scree plot of the principal component analysis on 45 state variables.

Theorems & Definitions (33)

  • Remark 1
  • Remark 2
  • Definition 1
  • Definition 2: Hierarchical Composition Model
  • Theorem 8
  • Remark 3: Technique distinctions.
  • Remark 4: Advantage of transfer RL under total transition similarity
  • Remark 5: Extension to online transfer RL
  • Theorem 9
  • Remark 6
  • ...and 23 more