Table of Contents
Fetching ...

PAC-Bayesian Reward-Certified Outcome Weighted Learning

Yuya Ishikawa, Shu Tamano

Abstract

Estimating optimal individualized treatment rules (ITRs) via outcome weighted learning (OWL) often relies on observed rewards that are noisy or optimistic proxies for the true latent utility. Ignoring this reward uncertainty leads to the selection of policies with inflated apparent performance, yet existing OWL frameworks lack the finite-sample guarantees required to systematically embed such uncertainty into the learning objective. To address this issue, we propose PAC-Bayesian Reward-Certified Outcome Weighted Learning (PROWL). Given a one-sided uncertainty certificate, PROWL constructs a conservative reward and a strictly policy-dependent lower bound on the true expected value. Theoretically, we prove an exact certified reduction that transforms robust policy learning into a unified, split-free cost-sensitive classification task. This formulation enables the derivation of a nonasymptotic PAC-Bayes lower bound for randomized ITRs, where we establish that the optimal posterior maximizing this bound is exactly characterized by a general Bayes update. To overcome the learning-rate selection problem inherent in generalized Bayesian inference, we introduce a fully automated, bounds-based calibration procedure, coupled with a Fisher-consistent certified hinge surrogate for efficient optimization. Our experiments demonstrate that PROWL achieves improvements in estimating robust, high-value treatment regimes under severe reward uncertainty compared to standard methods for ITR estimation.

PAC-Bayesian Reward-Certified Outcome Weighted Learning

Abstract

Estimating optimal individualized treatment rules (ITRs) via outcome weighted learning (OWL) often relies on observed rewards that are noisy or optimistic proxies for the true latent utility. Ignoring this reward uncertainty leads to the selection of policies with inflated apparent performance, yet existing OWL frameworks lack the finite-sample guarantees required to systematically embed such uncertainty into the learning objective. To address this issue, we propose PAC-Bayesian Reward-Certified Outcome Weighted Learning (PROWL). Given a one-sided uncertainty certificate, PROWL constructs a conservative reward and a strictly policy-dependent lower bound on the true expected value. Theoretically, we prove an exact certified reduction that transforms robust policy learning into a unified, split-free cost-sensitive classification task. This formulation enables the derivation of a nonasymptotic PAC-Bayes lower bound for randomized ITRs, where we establish that the optimal posterior maximizing this bound is exactly characterized by a general Bayes update. To overcome the learning-rate selection problem inherent in generalized Bayesian inference, we introduce a fully automated, bounds-based calibration procedure, coupled with a Fisher-consistent certified hinge surrogate for efficient optimization. Our experiments demonstrate that PROWL achieves improvements in estimating robust, high-value treatment regimes under severe reward uncertainty compared to standard methods for ITR estimation.

Paper Structure

This paper contains 60 sections, 18 theorems, 222 equations, 16 figures, 3 tables.

Key Result

Theorem 1

For every measurable ITR $d$ and every $\alpha\in\mathcal{A}$, Consequently, for each fixed $\alpha$, maximizing $V_{\underline{R}}(d)$ over a measurable $d$ is strictly equivalent to minimizing $\mathcal{R}_{01}^\alpha(d)$. Moreover, any Bayes rule for $V_{\underline{R}}$ takes the form $\blacktriangleleft$$\blacktriangleleft$

Figures (16)

  • Figure 1: Performance comparison across varying uncertainty levels ($\rho$) for Scenario 1. The left panel reports target regret against the $R$-family baselines, and the right panel reports robust regret against the $\underline{R}$-family baselines.
  • Figure 2: Performance comparison across varying uncertainty levels ($\rho$) for Scenario 2. The left panel reports target regret against the $R$-family baselines, and the right panel reports robust regret against the $\underline{R}$-family baselines.
  • Figure 3: Performance comparison across varying sample sizes ($N$) for Scenario 1. The left panel reports target regret against the $R$-family baselines, and the right panel reports robust regret against the $\underline{R}$-family baselines.
  • Figure 4: Performance comparison across varying sample sizes ($N$) for Scenario 1. The left panel reports target regret against the $R$-family baselines, and the right panel reports robust regret against the $\underline{R}$-family baselines.
  • Figure 5: Point-range summary of the primary actual-data comparison on ELAIA-1. Left panel reports the estimated certified value $\hat{V}_{\underline{R}}$ with $95\%$ confidence intervals across the $30$ repeated sample splits. Right panel reports the estimated composite-free value $\hat{V}_{\mathrm{comp}}$ with the same uncertainty summary. The figure complements Table \ref{['tab:actual-policy-comparison']} by emphasizing method ordering on the certified objective and on the trial's primary hard-outcome anchor.
  • ...and 11 more figures

Theorems & Definitions (37)

  • Theorem 1: Exact certified reduction
  • Proposition 2: Conditional moments and variance-optimal nuisance choices
  • Theorem 3: PAC-Bayes lower bound and exact general Bayes update
  • Proposition 4: Exact-value certification and temperature selection
  • Proposition 5: Fisher consistency and excess-risk domination of the certified hinge loss
  • Corollary 6: Posterior-family exact-value selection
  • Corollary 7: Learned certificates via auxiliary calibration
  • Proposition 8: Inverse-propensity representations
  • Proposition 9: Certified lower-value domination
  • Remark 1: When the certificate does not alter policy ranking
  • ...and 27 more