Table of Contents
Fetching ...

Semi-supervised Batch Learning From Logged Data

Gholamali Aminian, Armin Behnamnia, Roberto Vega, Laura Toni, Chengchun Shi, Hamid R. Rabiee, Omar Rivasplata, Miguel R. D. Rodrigues

TL;DR

This work addresses off-policy learning from logged bandit data when feedback is missing for some samples. It derives a variance-based upper bound on the IPS estimator that couples the learning policy with the logging policy through KL and reverse KL divergences, and uses this to motivate feedback-free KL-regularization in a semi-supervised setting. The authors introduce two algorithms, WCE-S2BL and KL-S2BL, that leverage both known-feedback and missing-feedback data via truncated IPS plus KL-based regularizers, and demonstrate improved policy performance across multiple image datasets and a real-world KuaiRec dataset. The approach reduces propensity overfitting and enables effective learning even when feedback is partially observed, with clear guidance on when KL or reverse KL regularization is advantageous. Potential extensions include propensity-score estimation, semi-supervised reinforcement learning, and integration with pessimistic offline RL concepts to further improve robustness.

Abstract

Off-policy learning methods are intended to learn a policy from logged data, which includes context, action, and feedback (cost or reward) for each sample point. In this work, we build on the counterfactual risk minimization framework, which also assumes access to propensity scores. We propose learning methods for problems where feedback is missing for some samples, so there are samples with feedback and samples missing-feedback in the logged data. We refer to this type of learning as semi-supervised batch learning from logged data, which arises in a wide range of application domains. We derive a novel upper bound for the true risk under the inverse propensity score estimator to address this kind of learning problem. Using this bound, we propose a regularized semi-supervised batch learning method with logged data where the regularization term is feedback-independent and, as a result, can be evaluated using the logged missing-feedback data. Consequently, even though feedback is only present for some samples, a learning policy can be learned by leveraging the missing-feedback samples. The results of experiments derived from benchmark datasets indicate that these algorithms achieve policies with better performance in comparison with logging policies.

Semi-supervised Batch Learning From Logged Data

TL;DR

This work addresses off-policy learning from logged bandit data when feedback is missing for some samples. It derives a variance-based upper bound on the IPS estimator that couples the learning policy with the logging policy through KL and reverse KL divergences, and uses this to motivate feedback-free KL-regularization in a semi-supervised setting. The authors introduce two algorithms, WCE-S2BL and KL-S2BL, that leverage both known-feedback and missing-feedback data via truncated IPS plus KL-based regularizers, and demonstrate improved policy performance across multiple image datasets and a real-world KuaiRec dataset. The approach reduces propensity overfitting and enables effective learning even when feedback is partially observed, with clear guidance on when KL or reverse KL regularization is advantageous. Potential extensions include propensity-score estimation, semi-supervised reinforcement learning, and integration with pessimistic offline RL concepts to further improve robustness.

Abstract

Off-policy learning methods are intended to learn a policy from logged data, which includes context, action, and feedback (cost or reward) for each sample point. In this work, we build on the counterfactual risk minimization framework, which also assumes access to propensity scores. We propose learning methods for problems where feedback is missing for some samples, so there are samples with feedback and samples missing-feedback in the logged data. We refer to this type of learning as semi-supervised batch learning from logged data, which arises in a wide range of application domains. We derive a novel upper bound for the true risk under the inverse propensity score estimator to address this kind of learning problem. Using this bound, we propose a regularized semi-supervised batch learning method with logged data where the regularization term is feedback-independent and, as a result, can be evaluated using the logged missing-feedback data. Consequently, even though feedback is only present for some samples, a learning policy can be learned by leveraging the missing-feedback samples. The results of experiments derived from benchmark datasets indicate that these algorithms achieve policies with better performance in comparison with logging policies.
Paper Structure (37 sections, 10 theorems, 80 equations, 3 figures, 13 tables, 1 algorithm)

This paper contains 37 sections, 10 theorems, 80 equations, 3 figures, 13 tables, 1 algorithm.

Key Result

Proposition 4.1

Suppose that the importance weighted of squared cost function, i.e., $w(A,X)c^2(A,X)$, is $\sigma$-sub-GaussianA random variable $X$ is $\sigma$-subgaussian if $E[e^{\gamma(X-E[X])}]\leq e^{\frac{\gamma^2 \sigma^2}{2}}$ for all $\gamma \in \mathbb{R}$. under $P_X\otimes \pi_0(A|X)$ and $P_X\otimes \ where $b_l=\max(b_1,0)$ and $b_u=\max(|b_1|,b_2)$.

Figures (3)

  • Figure 1: Accuracy of WCE-S2BL, KL-S2BL,WCE-S2BLK, KL-S2BLK, and B-CRM for $\tau=10$.
  • Figure 2: Accuracy of WCE-S2BL, KL-S2BL,WCE-S2BLK, KL-S2BLK, and BanditNet for $\tau=10$.
  • Figure 3: Accuracy of WCE-S2BL and KL-S2BL for different ratio of missing-feedback data samples to known-feedback data samples. We fix the number of known-feedback data samples to $1000$ samples.

Theorems & Definitions (23)

  • Proposition 4.1
  • Corollary 4.2
  • Theorem 4.3
  • Proposition 4.4
  • Proposition 5.1
  • Lemma D.1
  • proof
  • proof
  • Remark E.1: Uniform Coverage (Overlap) Assumption
  • proof
  • ...and 13 more