Table of Contents
Fetching ...

Leveraging Unlabeled Data Sharing through Kernel Function Approximation in Offline Reinforcement Learning

Yen-Ru Lai, Fu-Chieh Chang, Pei-Yuan Wu

TL;DR

This work extends reward-free data sharing to kernelized offline reinforcement learning by embedding reward estimation and pessimism within an RKHS framework. By combining kernel ridge regression for reward estimation, a pessimistic reward function, and PEVI with data splitting, it leverages unlabeled data to improve policy learning in finite-horizon MDPs while preserving theoretical guarantees. The analysis highlights how kernel eigenvalue decay governs information gain and suboptimality, with explicit bounds under d-finite, exponential, and polynomial decay regimes, and a weak data-coverage condition. Empirically, kernel methods outperform finite-dimensional feature maps when unlabeled data is scarce, and the asymptotic behavior of the value function aligns with the predicted rates. Overall, the approach offers a principled, theoretically grounded path to exploit unlabeled data in offline RL with broad function-approximation flexibility and practical impact for data-limited settings.

Abstract

Offline reinforcement learning (RL) learns policies from a fixed dataset, but often requires large amounts of data. The challenge arises when labeled datasets are expensive, especially when rewards have to be provided by human labelers for large datasets. In contrast, unlabelled data tends to be less expensive. This situation highlights the importance of finding effective ways to use unlabelled data in offline RL, especially when labelled data is limited or expensive to obtain. In this paper, we present the algorithm to utilize the unlabeled data in the offline RL method with kernel function approximation and give the theoretical guarantee. We present various eigenvalue decay conditions of $\mathcal{H}_k$ which determine the complexity of the algorithm. In summary, our work provides a promising approach for exploiting the advantages offered by unlabeled data in offline RL, whilst maintaining theoretical assurances.

Leveraging Unlabeled Data Sharing through Kernel Function Approximation in Offline Reinforcement Learning

TL;DR

This work extends reward-free data sharing to kernelized offline reinforcement learning by embedding reward estimation and pessimism within an RKHS framework. By combining kernel ridge regression for reward estimation, a pessimistic reward function, and PEVI with data splitting, it leverages unlabeled data to improve policy learning in finite-horizon MDPs while preserving theoretical guarantees. The analysis highlights how kernel eigenvalue decay governs information gain and suboptimality, with explicit bounds under d-finite, exponential, and polynomial decay regimes, and a weak data-coverage condition. Empirically, kernel methods outperform finite-dimensional feature maps when unlabeled data is scarce, and the asymptotic behavior of the value function aligns with the predicted rates. Overall, the approach offers a principled, theoretically grounded path to exploit unlabeled data in offline RL with broad function-approximation flexibility and practical impact for data-limited settings.

Abstract

Offline reinforcement learning (RL) learns policies from a fixed dataset, but often requires large amounts of data. The challenge arises when labeled datasets are expensive, especially when rewards have to be provided by human labelers for large datasets. In contrast, unlabelled data tends to be less expensive. This situation highlights the importance of finding effective ways to use unlabelled data in offline RL, especially when labelled data is limited or expensive to obtain. In this paper, we present the algorithm to utilize the unlabeled data in the offline RL method with kernel function approximation and give the theoretical guarantee. We present various eigenvalue decay conditions of which determine the complexity of the algorithm. In summary, our work provides a promising approach for exploiting the advantages offered by unlabeled data in offline RL, whilst maintaining theoretical assurances.
Paper Structure (23 sections, 13 theorems, 121 equations, 2 figures, 1 table, 3 algorithms)

This paper contains 23 sections, 13 theorems, 121 equations, 2 figures, 1 table, 3 algorithms.

Key Result

Proposition 4.1

We define $\beta_h(\delta)$ with the labeled data set $\mathcal{D}_1$ by $\beta_h(\delta)=\sqrt{\nu}\mathscr{S}+\sqrt{\log\frac{\operatorname{det}\left[\nu I+K_h^{\mathcal{D}_1}\right]}{\delta^2}}$, where $K_h^{\mathcal{D}_1}$ is the Gram matrix constructed from the dataset $\mathcal{D}_1$ as $\left Moreover, define $\mathcal{C}_h(\delta)=\left\{\theta \in \mathcal{H}_k:\left\|\theta-\widehat{\the

Figures (2)

  • Figure 1: Comparison of experimental values and asymptotic approximation of $V^{\pi}_1(s)$.
  • Figure 2: Comparison the values of $V^{\pi}_1(s)$ between finite dimensional features and kernel features.

Theorems & Definitions (28)

  • Proposition 4.1
  • proof
  • Lemma 4.2
  • Theorem 4.3
  • proof
  • Remark 4.4
  • Proposition 4.5
  • proof
  • Remark 4.6
  • Corollary 4.8: Well-Explored Dataset
  • ...and 18 more