Data-Driven Knowledge Transfer in Batch $Q^*$ Learning

Elynn Chen; Xi Chen; Wenbo Jing

Data-Driven Knowledge Transfer in Batch $Q^*$ Learning

Elynn Chen, Xi Chen, Wenbo Jing

TL;DR

This work tackles data scarcity in offline, high-dimensional RL by proposing a Transfer FQI framework that directly estimates the optimal action-value function $Q^*$ across related MDPs. The method simultaneously learns a shared center component from source tasks and corrects task-specific bias, employing both general function approximation and a sieve-based, semi-parametric approach. The authors derive regret bounds for both transition-homogeneous and transition-heterogeneous settings, provide a data-driven procedure to adapt the function-basis size, and establish conditions under which knowledge transfer outperforms single-task learning. Empirical studies on simulations and a MIMIC-III clinical dataset demonstrate substantial gains from transfer when source tasks are informative and discrepancies are moderate, highlighting practical impact for data-scarce domains.

Abstract

In data-driven decision-making in marketing, healthcare, and education, it is desirable to utilize a large amount of data from existing ventures to navigate high-dimensional feature spaces and address data scarcity in new ventures. We explore knowledge transfer in dynamic decision-making by concentrating on batch stationary environments and formally defining task discrepancies through the lens of Markov decision processes (MDPs). We propose a framework of Transferred Fitted $Q$-Iteration algorithm with general function approximation, enabling the direct estimation of the optimal action-state function $Q^*$ using both target and source data. We establish the relationship between statistical performance and MDP task discrepancy under sieve approximation, shedding light on the impact of source and target sample sizes and task discrepancy on the effectiveness of knowledge transfer. We show that the final learning error of the $Q^*$ function is significantly improved from the single task rate both theoretically and empirically.

Data-Driven Knowledge Transfer in Batch $Q^*$ Learning

TL;DR

This work tackles data scarcity in offline, high-dimensional RL by proposing a Transfer FQI framework that directly estimates the optimal action-value function

across related MDPs. The method simultaneously learns a shared center component from source tasks and corrects task-specific bias, employing both general function approximation and a sieve-based, semi-parametric approach. The authors derive regret bounds for both transition-homogeneous and transition-heterogeneous settings, provide a data-driven procedure to adapt the function-basis size, and establish conditions under which knowledge transfer outperforms single-task learning. Empirical studies on simulations and a MIMIC-III clinical dataset demonstrate substantial gains from transfer when source tasks are informative and discrepancies are moderate, highlighting practical impact for data-scarce domains.

Abstract

-Iteration algorithm with general function approximation, enabling the direct estimation of the optimal action-state function

using both target and source data. We establish the relationship between statistical performance and MDP task discrepancy under sieve approximation, shedding light on the impact of source and target sample sizes and task discrepancy on the effectiveness of knowledge transfer. We show that the final learning error of the

function is significantly improved from the single task rate both theoretically and empirically.

Paper Structure (27 sections, 7 theorems, 139 equations, 3 figures, 2 algorithms)

This paper contains 27 sections, 7 theorems, 139 equations, 3 figures, 2 algorithms.

Introduction
Literature and Organization
Statistical Framework
Mathematical Framework for RL
The Target and Source RL Data.
Similarity Measure for Transferring between Different MDPs.
Batch $Q^*$ Learning with Knowledge Transfer
Transfer FQI with General Function Approximation
Transfer FQI with Sieve Function Approximation
Theory
Theoretical Results for Transition Homogeneous Tasks
Select the Number of Basis Functions
Theoretical Results for Transition Heterogeneous Tasks
Empirical Studies
Simulations
...and 12 more sections

Key Result

Lemma 2.1

Let the difference between the optimal action-value functions across different tasks be defined as Assume that the reward functions $r^{(k)}(\boldsymbol{x}, a)$ are uniformly upper bounded by a constant $R_{\max}$. Then we have

Figures (3)

Figure 1: Boxplots of the estimation errors for the $Q^*$ function with $I^{(0)}=20$ and different source sample sizes $I^{(1)}$. The parameter $\sigma_C$ on the top of each subfigure indicates the standard deviation of the difference matrix $\boldsymbol{C}_{\delta}$.
Figure 2: The estimated $Q^*$ values obtained by the "one-step" and "two-step" methods for $Q^*((x, 0, 0), -1)$ (left) and $Q^*((x, 0, 0), 1)$ (right). The legend "truth" represents the true value of the $Q^*$ function approximated by Monte Carlo simulation.
Figure 3: The regrets $v^{\pi^*}-v^{\widehat{\pi}}$ of the policies obtained by "one-step" and "two-step" algorithms with $I^{(0)}=100$ (left) and $I^{(0)}=500$ (right). The black dashed line shows the regret of FQI on the target task without knowledge transfer.

Theorems & Definitions (19)

Lemma 2.1: Difference of $Q^*$
Remark 1
Definition 4.2: Hölder $\kappa$-smooth functions
Lemma 4.4
Remark 2
Theorem 4.5
Remark 3
Remark 4
Remark 5
Theorem 4.6
...and 9 more

Data-Driven Knowledge Transfer in Batch $Q^*$ Learning

TL;DR

Abstract

Data-Driven Knowledge Transfer in Batch $Q^*$ Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (19)