Deep Transfer $Q$-Learning for Offline Non-Stationary Reinforcement Learning

Jinhang Chai; Elynn Chen; Jianqing Fan

Deep Transfer $Q$-Learning for Offline Non-Stationary Reinforcement Learning

Jinhang Chai, Elynn Chen, Jianqing Fan

TL;DR

This paper pioneers the study of transfer learning for dynamic decision scenarios modeled by non-stationary finite-horizon Markov decision processes, utilizing neural networks as powerful function approximators and backward inductive learning.

Abstract

In dynamic decision-making scenarios across business and healthcare, leveraging sample trajectories from diverse populations can significantly enhance reinforcement learning (RL) performance for specific target populations, especially when sample sizes are limited. While existing transfer learning methods primarily focus on linear regression settings, they lack direct applicability to reinforcement learning algorithms. This paper pioneers the study of transfer learning for dynamic decision scenarios modeled by non-stationary finite-horizon Markov decision processes, utilizing neural networks as powerful function approximators and backward inductive learning. We demonstrate that naive sample pooling strategies, effective in regression settings, fail in Markov decision processes.To address this challenge, we introduce a novel ``re-weighted targeting procedure'' to construct ``transferable RL samples'' and propose ``transfer deep $Q^*$-learning'', enabling neural network approximation with theoretical guarantees. We assume that the reward functions are transferable and deal with both situations in which the transition densities are transferable or nontransferable. Our analytical techniques for transfer learning in neural network approximation and transition density transfers have broader implications, extending to supervised transfer learning with neural networks and domain shift scenarios. Empirical experiments on both synthetic and real datasets corroborate the advantages of our method, showcasing its potential for improving decision-making through strategically constructing transferable RL samples in non-stationary reinforcement learning contexts.

Deep Transfer $Q$-Learning for Offline Non-Stationary Reinforcement Learning

TL;DR

Abstract

-learning'', enabling neural network approximation with theoretical guarantees. We assume that the reward functions are transferable and deal with both situations in which the transition densities are transferable or nontransferable. Our analytical techniques for transfer learning in neural network approximation and transition density transfers have broader implications, extending to supervised transfer learning with neural networks and domain shift scenarios. Empirical experiments on both synthetic and real datasets corroborate the advantages of our method, showcasing its potential for improving decision-making through strategically constructing transferable RL samples in non-stationary reinforcement learning contexts.

Paper Structure (35 sections, 11 theorems, 136 equations, 4 figures, 1 table, 4 algorithms)

This paper contains 35 sections, 11 theorems, 136 equations, 4 figures, 1 table, 4 algorithms.

Introduction
Related Works and Distinctions of this Work
Organization
Transfer RL and Transferable Samples
Transfer Reinforcement Learning
Similarity Characterizations
Challenges in Transfer $Q$-Learning
Re-Weighted Targeting for Transferable Samples
Aggregated Reward and $Q^*$-Functions
Transfer Backward-Inductive $Q$-Learning
Transfer $Q$-Learning with DNN Approximation
Deep Neural Networks for $Q^*$-Function Approximation.
Transition Ratio Estimation without Transition Transfer
Transition Density Ratio Estimation with Transfer
Theoretical Results with DNN Approximation
...and 20 more sections

Key Result

Theorem 8

Consider the transfer RL setting with $K+1$ finite-horizon non-stationary MDPs: ${\cal M}^{(k)} = \left\{ {\cal S}, {\cal A}, P^{(k)}, r^{(k)}, \gamma, T \right\}$ for $k\in\{0\}\cup[K]$. Let $\widehat{Q}_t^{\rm tr}$ denote the estimator obtained by Algorithm alg:rwt-trans-q-general with DNN appr where $J:=\left\lvert{\cal A}\right\rvert$, $n_0$ and $n_{{\cal M}}$ are the number of trajectori

Figures (4)

Figure 1: Experimental workflow: Phase (a) collects initial target data using uniform random policies. Phase (b) applies RWT Transfer $Q$-learning using both target and source data. Phase (c) conducts on-policy evaluation of the derived greedy policy from $\widehat{Q}^{\rm (tr)}$ in the target environment.
Figure 2: Cumulative regrets (left) and rewards (right) of the online evaluation phase with or without transfer, following the scheme illustrated in Figure \ref{['fig:offline-online']}. The offline source data set has $10,000$ trajectories. The cumulative regrets and rewards is shown as a function of the "Target Sample Size", corresponding to the amount of target data collected in Phase (a). The online evaluation phase deploys the greedy policy for both with or without transfer.
Figure 3: Cumulative rewards of the online evaluation phase with or without transfer in the MIMIC-3 calibrated environments, averaged over $1,000$ trajectories and following the scheme illustrated in Figure \ref{['fig:offline-online']}. The values of the purple line are nearly identically across different target sample sizes, with differences only appearing in the third decimal place. The offline source data sets has $10,000$ trajectories. The x-axis titled "Target Sample Size" represents the number of target data sampled in the phase of initial target data collection. The online evaluation deploys the greedy policy for both with or without transfer.
Figure 4: Scree plot of the principal component analysis on 45 state variables.

Theorems & Definitions (33)

Remark 1
Remark 2
Definition 1
Definition 2: Hierarchical Composition Model
Theorem 8
Remark 3: Technique distinctions.
Remark 4: Advantage of transfer RL under total transition similarity
Remark 5: Extension to online transfer RL
Theorem 9
Remark 6
...and 23 more

Deep Transfer $Q$-Learning for Offline Non-Stationary Reinforcement Learning

TL;DR

Abstract

Deep Transfer $Q$-Learning for Offline Non-Stationary Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (33)