Table of Contents
Fetching ...

Semi-pessimistic Reinforcement Learning

Jin Zhu, Xin Zhou, Jiaang Yao, Gholamali Aminian, Omar Rivasplata, Simon Little, Lexin Li, Chengchun Shi

TL;DR

This work tackles offline RL under distributional shift and reward-sparse settings by introducing semi-pessimistic RL (SPL), which leverages abundant unlabeled data to form a pessimistic lower bound on the reward. The method combines semi-supervised uncertainty quantification with imputation of unlabeled rewards, yielding two algorithms (model-free and model-based) that operate under a new semi-coverage condition. Theoretical results provide finite-sample regret bounds that tighten as unlabeled data grows and as reward uncertainty decreases, with empirical validation on synthetic tasks, MuJoCo benchmarks, and a semi-synthetic adaptive deep brain stimulation (DBS) application. The approach offers a practical, flexible framework for high-stakes domains where rewards are scarce but unlabeled transitions are plentiful, improving policy learning without resorting to the more restrictive full pessimism priors.

Abstract

Offline reinforcement learning (RL) aims to learn an optimal policy from pre-collected data. However, it faces challenges of distributional shift, where the learned policy may encounter unseen scenarios not covered in the offline data. Additionally, numerous applications suffer from a scarcity of labeled reward data. Relying on labeled data alone often leads to a narrow state-action distribution, further amplifying the distributional shift, and resulting in suboptimal policy learning. To address these issues, we first recognize that the volume of unlabeled data is typically substantially larger than that of labeled data. We then propose a semi-pessimistic RL method to effectively leverage abundant unlabeled data. Our approach offers several advantages. It considerably simplifies the learning process, as it seeks a lower bound of the reward function, rather than that of the Q-function or state transition function. It is highly flexible, and can be integrated with a range of model-free and model-based RL algorithms. It enjoys the guaranteed improvement when utilizing vast unlabeled data, but requires much less restrictive conditions. We compare our method with a number of alternative solutions, both analytically and numerically, and demonstrate its clear competitiveness. We further illustrate with an application to adaptive deep brain stimulation for Parkinson's disease.

Semi-pessimistic Reinforcement Learning

TL;DR

This work tackles offline RL under distributional shift and reward-sparse settings by introducing semi-pessimistic RL (SPL), which leverages abundant unlabeled data to form a pessimistic lower bound on the reward. The method combines semi-supervised uncertainty quantification with imputation of unlabeled rewards, yielding two algorithms (model-free and model-based) that operate under a new semi-coverage condition. Theoretical results provide finite-sample regret bounds that tighten as unlabeled data grows and as reward uncertainty decreases, with empirical validation on synthetic tasks, MuJoCo benchmarks, and a semi-synthetic adaptive deep brain stimulation (DBS) application. The approach offers a practical, flexible framework for high-stakes domains where rewards are scarce but unlabeled transitions are plentiful, improving policy learning without resorting to the more restrictive full pessimism priors.

Abstract

Offline reinforcement learning (RL) aims to learn an optimal policy from pre-collected data. However, it faces challenges of distributional shift, where the learned policy may encounter unseen scenarios not covered in the offline data. Additionally, numerous applications suffer from a scarcity of labeled reward data. Relying on labeled data alone often leads to a narrow state-action distribution, further amplifying the distributional shift, and resulting in suboptimal policy learning. To address these issues, we first recognize that the volume of unlabeled data is typically substantially larger than that of labeled data. We then propose a semi-pessimistic RL method to effectively leverage abundant unlabeled data. Our approach offers several advantages. It considerably simplifies the learning process, as it seeks a lower bound of the reward function, rather than that of the Q-function or state transition function. It is highly flexible, and can be integrated with a range of model-free and model-based RL algorithms. It enjoys the guaranteed improvement when utilizing vast unlabeled data, but requires much less restrictive conditions. We compare our method with a number of alternative solutions, both analytically and numerically, and demonstrate its clear competitiveness. We further illustrate with an application to adaptive deep brain stimulation for Parkinson's disease.

Paper Structure

This paper contains 29 sections, 7 theorems, 67 equations, 5 figures, 1 table, 3 algorithms.

Key Result

Lemma 1

For any $(s,a)\in\mathcal{S}\times\mathcal{A}$ with $d_{\mathcal{L}}(s,a)>0$, we have $d_{\mathcal{L}\cup \mathcal{U}}(s,a)>0$.

Figures (5)

  • Figure 1: (a): Graphical illustration of the environment. (b): The average return with $n_\mathcal{L} = 120$; left panel: the ratio $n_\mathcal{U} / n_\mathcal{L}$ (horizontal axis) varies; right panel: the value of $\epsilon$ (horizontal axis) varies with $n_\mathcal{U} = 150$.
  • Figure 2: Visualization of the state-action distribution of $\mathcal{L}$, $\mathcal{U}$ and $\mathcal{L} \cup \mathcal{U}$ on an MDP with five states (horizontal axis) and three actions (vertical axis). Each state-action pair $(s, a)$ is colored green if $d_{\mathcal{D}}(s, a) > 0$, indicating its presence in the data, and orange if absent.
  • Figure 3: Average regret of the policy learned by numerous model-free RL algorithms in the synthetic environment. (a): the size of the labeled data $n_\mathcal{L}$ (horizontal axis) varies with $n_\mathcal{U} / n_\mathcal{L}$ fixed at 10; (b) the ratio $n_\mathcal{U} / n_\mathcal{L}$ (horizontal axis) varies with $n_\mathcal{L}$ fixed at 32.
  • Figure 4: Average cumulative reward of the policy learned by numerous model-based RL algorithms in the MuJoCo environments. The error bar shows the standard deviation. Rows represent different state-action distributions of $\mathcal{U}$, and columns represent three different MuJoCo environments.
  • Figure 5: The cumulative reward of SPL and PPL under different values of $\epsilon$, where the unlabeled data is generated using the $\epsilon$-greedy algorithm. A small $\epsilon$ indicates that the behavior policy that generates the unlabeled data is close to the optimal policy, whereas a large $\epsilon$ indicates that the behavior policy is close to random exploration.

Theorems & Definitions (12)

  • Lemma 1
  • Lemma 2
  • Theorem 1: Regret of the model-free algorithm
  • Theorem 2: Regret of the model-based algorithm
  • Corollary 1
  • Theorem 3: Regret of the model-based PPL
  • proof
  • Lemma 3
  • proof
  • proof
  • ...and 2 more