Semi-supervised Batch Learning From Logged Data
Gholamali Aminian, Armin Behnamnia, Roberto Vega, Laura Toni, Chengchun Shi, Hamid R. Rabiee, Omar Rivasplata, Miguel R. D. Rodrigues
TL;DR
This work addresses off-policy learning from logged bandit data when feedback is missing for some samples. It derives a variance-based upper bound on the IPS estimator that couples the learning policy with the logging policy through KL and reverse KL divergences, and uses this to motivate feedback-free KL-regularization in a semi-supervised setting. The authors introduce two algorithms, WCE-S2BL and KL-S2BL, that leverage both known-feedback and missing-feedback data via truncated IPS plus KL-based regularizers, and demonstrate improved policy performance across multiple image datasets and a real-world KuaiRec dataset. The approach reduces propensity overfitting and enables effective learning even when feedback is partially observed, with clear guidance on when KL or reverse KL regularization is advantageous. Potential extensions include propensity-score estimation, semi-supervised reinforcement learning, and integration with pessimistic offline RL concepts to further improve robustness.
Abstract
Off-policy learning methods are intended to learn a policy from logged data, which includes context, action, and feedback (cost or reward) for each sample point. In this work, we build on the counterfactual risk minimization framework, which also assumes access to propensity scores. We propose learning methods for problems where feedback is missing for some samples, so there are samples with feedback and samples missing-feedback in the logged data. We refer to this type of learning as semi-supervised batch learning from logged data, which arises in a wide range of application domains. We derive a novel upper bound for the true risk under the inverse propensity score estimator to address this kind of learning problem. Using this bound, we propose a regularized semi-supervised batch learning method with logged data where the regularization term is feedback-independent and, as a result, can be evaluated using the logged missing-feedback data. Consequently, even though feedback is only present for some samples, a learning policy can be learned by leveraging the missing-feedback samples. The results of experiments derived from benchmark datasets indicate that these algorithms achieve policies with better performance in comparison with logging policies.
