Table of Contents
Fetching ...

Optimal Policy Adaptation under Covariate Shift

Xueqing Liu, Qinwei Yang, Zhaoqing Tian, Ruocheng Guo, Peng Wu

TL;DR

This work addresses transferring an optimal policy to a target domain where only covariates are observed, by leveraging a labeled source dataset and a covariate-shift assumption. It builds a causal-identifiability framework under unconfoundedness and transportability, derives the efficient influence function, and proposes a semiparametric efficient, doubly robust estimator for the target reward $R(\pi)$. The policy is learned by maximizing the SE estimate, with theoretical guarantees on consistency, asymptotic normality, and a generalization bound, and the approach extends to maximizing $V(\pi)$ over the entire domain. Empirical results on simulated and real-world data show the SE method yields more accurate reward estimates and policies with substantial improvements over Direct and IPW baselines, supporting practical applicability in settings with covariate shift.

Abstract

Transfer learning of prediction models has been extensively studied, while the corresponding policy learning approaches are rarely discussed. In this paper, we propose principled approaches for learning the optimal policy in the target domain by leveraging two datasets: one with full information from the source domain and the other from the target domain with only covariates. First, under the setting of covariate shift, we formulate the problem from a perspective of causality and present the identifiability assumptions for the reward induced by a given policy. Then, we derive the efficient influence function and the semiparametric efficiency bound for the reward. Based on this, we construct a doubly robust and semiparametric efficient estimator for the reward and then learn the optimal policy by optimizing the estimated reward. Moreover, we theoretically analyze the bias and the generalization error bound for the learned policy. Extensive experiments demonstrate that the approach not only estimates the reward more accurately but also yields a policy that closely approximates the theoretically optimal policy.

Optimal Policy Adaptation under Covariate Shift

TL;DR

This work addresses transferring an optimal policy to a target domain where only covariates are observed, by leveraging a labeled source dataset and a covariate-shift assumption. It builds a causal-identifiability framework under unconfoundedness and transportability, derives the efficient influence function, and proposes a semiparametric efficient, doubly robust estimator for the target reward . The policy is learned by maximizing the SE estimate, with theoretical guarantees on consistency, asymptotic normality, and a generalization bound, and the approach extends to maximizing over the entire domain. Empirical results on simulated and real-world data show the SE method yields more accurate reward estimates and policies with substantial improvements over Direct and IPW baselines, supporting practical applicability in settings with covariate shift.

Abstract

Transfer learning of prediction models has been extensively studied, while the corresponding policy learning approaches are rarely discussed. In this paper, we propose principled approaches for learning the optimal policy in the target domain by leveraging two datasets: one with full information from the source domain and the other from the target domain with only covariates. First, under the setting of covariate shift, we formulate the problem from a perspective of causality and present the identifiability assumptions for the reward induced by a given policy. Then, we derive the efficient influence function and the semiparametric efficiency bound for the reward. Based on this, we construct a doubly robust and semiparametric efficient estimator for the reward and then learn the optimal policy by optimizing the estimated reward. Moreover, we theoretically analyze the bias and the generalization error bound for the learned policy. Extensive experiments demonstrate that the approach not only estimates the reward more accurately but also yields a policy that closely approximates the theoretically optimal policy.
Paper Structure (26 sections, 7 theorems, 25 equations, 2 figures, 1 table, 1 algorithm)

This paper contains 26 sections, 7 theorems, 25 equations, 2 figures, 1 table, 1 algorithm.

Key Result

Lemma 1

The oracle policy where $\max_{\pi}$ is taken over all possible policies without constraints, rather than being restricted to $\Pi$.

Figures (2)

  • Figure 1: Comparison of three methods with different means of covariates in the target dataset
  • Figure 2: Comparison of three methods with different treatments in the target dataset

Theorems & Definitions (7)

  • Lemma 1
  • Theorem 1: Efficiency Bound of $R(\pi)$
  • Proposition 1: Double Robustness of $\hat{R}_{\text{SE}}(\pi)$
  • Theorem 2: Efficiency of $\hat{R}_{\text{SE}}(\pi)$
  • Proposition 2: Bias
  • Theorem 3: Generalization Error Bound
  • Theorem 4: Efficiency Bound of $V(\pi)$