Semi-pessimistic Reinforcement Learning
Jin Zhu, Xin Zhou, Jiaang Yao, Gholamali Aminian, Omar Rivasplata, Simon Little, Lexin Li, Chengchun Shi
TL;DR
This work tackles offline RL under distributional shift and reward-sparse settings by introducing semi-pessimistic RL (SPL), which leverages abundant unlabeled data to form a pessimistic lower bound on the reward. The method combines semi-supervised uncertainty quantification with imputation of unlabeled rewards, yielding two algorithms (model-free and model-based) that operate under a new semi-coverage condition. Theoretical results provide finite-sample regret bounds that tighten as unlabeled data grows and as reward uncertainty decreases, with empirical validation on synthetic tasks, MuJoCo benchmarks, and a semi-synthetic adaptive deep brain stimulation (DBS) application. The approach offers a practical, flexible framework for high-stakes domains where rewards are scarce but unlabeled transitions are plentiful, improving policy learning without resorting to the more restrictive full pessimism priors.
Abstract
Offline reinforcement learning (RL) aims to learn an optimal policy from pre-collected data. However, it faces challenges of distributional shift, where the learned policy may encounter unseen scenarios not covered in the offline data. Additionally, numerous applications suffer from a scarcity of labeled reward data. Relying on labeled data alone often leads to a narrow state-action distribution, further amplifying the distributional shift, and resulting in suboptimal policy learning. To address these issues, we first recognize that the volume of unlabeled data is typically substantially larger than that of labeled data. We then propose a semi-pessimistic RL method to effectively leverage abundant unlabeled data. Our approach offers several advantages. It considerably simplifies the learning process, as it seeks a lower bound of the reward function, rather than that of the Q-function or state transition function. It is highly flexible, and can be integrated with a range of model-free and model-based RL algorithms. It enjoys the guaranteed improvement when utilizing vast unlabeled data, but requires much less restrictive conditions. We compare our method with a number of alternative solutions, both analytically and numerically, and demonstrate its clear competitiveness. We further illustrate with an application to adaptive deep brain stimulation for Parkinson's disease.
