Table of Contents
Fetching ...

Transfer Learning of CATE with Kernel Ridge Regression

Seok-Jin Kim, Hongjie Liu, Molei Liu, Kaizheng Wang

TL;DR

This work proposes a novel method for overlap-adaptive transfer learning of conditional average treatment effect (CATE) using kernel ridge regression (KRR) and provides a theoretical justification for the method through sharp non-asymptotic MSE bounds, highlighting its adaptivity to both weak overlaps and the complexity of CATE function.

Abstract

The proliferation of data has sparked significant interest in leveraging findings from one study to estimate treatment effects in a different target population without direct outcome observations. However, the transfer learning process is frequently hindered by substantial covariate shift and limited overlap between (i) the source and target populations, as well as (ii) the treatment and control groups within the source. We propose a novel method for overlap-adaptive transfer learning of conditional average treatment effect (CATE) using kernel ridge regression (KRR). Our approach involves partitioning the labeled source data into two subsets. The first one is used to train candidate CATE models based on regression adjustment and pseudo-outcomes. An optimal model is then selected using the second subset and unlabeled target data, employing another pseudo-outcome-based strategy. We provide a theoretical justification for our method through sharp non-asymptotic MSE bounds, highlighting its adaptivity to both weak overlaps and the complexity of CATE function. Extensive numerical studies confirm that our method achieves superior finite-sample efficiency and adaptability. We conclude by demonstrating the effectiveness of our approach using a 401(k) eligibility dataset.

Transfer Learning of CATE with Kernel Ridge Regression

TL;DR

This work proposes a novel method for overlap-adaptive transfer learning of conditional average treatment effect (CATE) using kernel ridge regression (KRR) and provides a theoretical justification for the method through sharp non-asymptotic MSE bounds, highlighting its adaptivity to both weak overlaps and the complexity of CATE function.

Abstract

The proliferation of data has sparked significant interest in leveraging findings from one study to estimate treatment effects in a different target population without direct outcome observations. However, the transfer learning process is frequently hindered by substantial covariate shift and limited overlap between (i) the source and target populations, as well as (ii) the treatment and control groups within the source. We propose a novel method for overlap-adaptive transfer learning of conditional average treatment effect (CATE) using kernel ridge regression (KRR). Our approach involves partitioning the labeled source data into two subsets. The first one is used to train candidate CATE models based on regression adjustment and pseudo-outcomes. An optimal model is then selected using the second subset and unlabeled target data, employing another pseudo-outcome-based strategy. We provide a theoretical justification for our method through sharp non-asymptotic MSE bounds, highlighting its adaptivity to both weak overlaps and the complexity of CATE function. Extensive numerical studies confirm that our method achieves superior finite-sample efficiency and adaptability. We conclude by demonstrating the effectiveness of our approach using a 401(k) eligibility dataset.

Paper Structure

This paper contains 61 sections, 31 theorems, 197 equations, 3 figures, 3 tables, 3 algorithms.

Key Result

Theorem 1

Suppose that we run COKE under Assumptions assumption; boundednessassumption; consistency and unconfoundednessassumption; subGaussian noiseassumption; overlap source targetassumption; weak treatment overlapAssumption; eigenvalue decay. We further assume that $n > BR$ and that $\| h^\star\|_{\mathcal Here, $\lesssim$ hides absolute constants, $\sigma,\xi$, and logarithmic factors.

Figures (3)

  • Figure 1: Performance of COKE, ACW-CATE, DR-CATE and SR across varying simulation settings. Panels show the average MSE as a function of: (i) $S_B$ (degree of covariate shift between source and target) for $q = 1$, (ii) $S_R$ (degree of shift between treatment and control groups), (iii) $c$ (complexity of outcome models relative to the CATE), (iv) $S_B$ for $q = 2$ (weak overlap on two-dimensional covariates), and (v) $n_{{\mathcal{T}}}=n/4$.
  • Figure 2: Empirical distribution of the logarithms of the estimated density ratio (using the logistic regression) between the source and target.
  • Figure A1: Comparison of mean squared error for COKE with and without cross-fitting under the default setting with $q = 1$ across varying $S_B$ (degree of covariate shift between source and target). Cross-fitting reduces estimation error consistently by approximately $13.5\%$--$15\%$.

Theorems & Definitions (41)

  • Remark 1
  • Example 1: $R$: Bounded propensity score
  • Example 2: $B$: Bounded source-target density ratio
  • Example 3: $R$: Singular propensity score
  • Example 4: $B$: Dirac target distribution
  • Definition 1: Effective sample size
  • Theorem 1: MSE bound of final model
  • Theorem 2: MSE bound of RA learner estimators
  • Corollary 1: Optimal MSE bound among candidates
  • Proposition 1: Oracle inequality for in-sample MSE
  • ...and 31 more