Table of Contents
Fetching ...

Stratified Learning: A General-Purpose Statistical Method for Improved Learning under Covariate Shift

Maximilian Autenrieth, David A. van Dyk, Roberto Trotta, David C. Stenning

TL;DR

StratLearn addresses covariate shift by conditioning on propensity scores $e(x)$ and stratifying data into $k$ strata, enabling source data within each stratum to approximate the target distribution and minimizing target risk without heavy weighting. The authors prove that $p_T(x,y|e(x)) = p_S(x,y|e(x))$ within strata, and demonstrate strong empirical gains over state-of-the-art importance weighting in cosmology tasks, including SNIa classification with an updated SPCC AUC of $0.958$ and improved photo-$z$ density estimation on SDSS data. The method is general-purpose, scalable to high-dimensional covariates, and supported by balance diagnostics (SMD and KS) and diagnostic use of predicted outcomes. StratLearn offers a robust alternative to weighting, with broad applicability beyond astronomy, and it integrates causal-inference balance diagnostics into domain adaptation. Overall, the paper provides both theoretical guarantees and practical evidence that propensity-score stratification can effectively neutralize covariate shift and enhance predictive performance across diverse supervised learning tasks.

Abstract

We propose a simple, statistically principled, and theoretically justified method to improve supervised learning when the training set is not representative, a situation known as covariate shift. We build upon a well-established methodology in causal inference, and show that the effects of covariate shift can be reduced or eliminated by conditioning on propensity scores. In practice, this is achieved by fitting learners within strata constructed by partitioning the data based on the estimated propensity scores, leading to approximately balanced covariates and much-improved target prediction. We demonstrate the effectiveness of our general-purpose method on two contemporary research questions in cosmology, outperforming state-of-the-art importance weighting methods. We obtain the best reported AUC (0.958) on the updated "Supernovae photometric classification challenge", and we improve upon existing conditional density estimation of galaxy redshift from Sloan Data Sky Survey (SDSS) data.

Stratified Learning: A General-Purpose Statistical Method for Improved Learning under Covariate Shift

TL;DR

StratLearn addresses covariate shift by conditioning on propensity scores and stratifying data into strata, enabling source data within each stratum to approximate the target distribution and minimizing target risk without heavy weighting. The authors prove that within strata, and demonstrate strong empirical gains over state-of-the-art importance weighting in cosmology tasks, including SNIa classification with an updated SPCC AUC of and improved photo- density estimation on SDSS data. The method is general-purpose, scalable to high-dimensional covariates, and supported by balance diagnostics (SMD and KS) and diagnostic use of predicted outcomes. StratLearn offers a robust alternative to weighting, with broad applicability beyond astronomy, and it integrates causal-inference balance diagnostics into domain adaptation. Overall, the paper provides both theoretical guarantees and practical evidence that propensity-score stratification can effectively neutralize covariate shift and enhance predictive performance across diverse supervised learning tasks.

Abstract

We propose a simple, statistically principled, and theoretically justified method to improve supervised learning when the training set is not representative, a situation known as covariate shift. We build upon a well-established methodology in causal inference, and show that the effects of covariate shift can be reduced or eliminated by conditioning on propensity scores. In practice, this is achieved by fitting learners within strata constructed by partitioning the data based on the estimated propensity scores, leading to approximately balanced covariates and much-improved target prediction. We demonstrate the effectiveness of our general-purpose method on two contemporary research questions in cosmology, outperforming state-of-the-art importance weighting methods. We obtain the best reported AUC (0.958) on the updated "Supernovae photometric classification challenge", and we improve upon existing conditional density estimation of galaxy redshift from Sloan Data Sky Survey (SDSS) data.

Paper Structure

This paper contains 46 sections, 1 theorem, 19 equations, 8 figures, 12 tables.

Key Result

Proposition 1

If $p_S(x,y)$ and $p_T(x,y)$ satisfy the covariate shift definition and $0<e(x)<1$, then it holds that That is, conditional on $e(x)$ the joint source and target distributions are the same, eliminating covariate shift. It follows, for any loss function $\ell = \ell (f(x),y)$,

Figures (8)

  • Figure 1: StratLearn flow chart. (*Covariate balance and outcome balance is assessed as described in Section \ref{['section:balance_diagnostics']}, with a numerical example given in Section \ref{['section:Classification_SPCC']}.)
  • Figure 2: Example of photometric LC data, including $1\sigma$ error bars, for a typical SNIa (specifically, SN2475 from the updated kessler2010results simulated SPCC data).
  • Figure 3: Comparison of ROC curves for SNIa classification using the updated SPCC data. Here, Biased and uLSIF are identical. Bootstrap AUC standard errors (from 400 bootstrap samples) are given in parentheses.
  • Figure 4: Absolute standardized mean differences between source and target data of stratum 1 plotted against "raw" data absolute standardized mean differences for StratLearn and STACCATO.
  • Figure 5: Target risk ($\hat{R}_T$) of the four photo-$z$ estimation models under each method (different colors), using different sets of predictors. Bars give the mean $\pm$ 2 bootstrap standard errors (from $400$ bootstrap samples).
  • ...and 3 more figures

Theorems & Definitions (2)

  • Proposition 1: Learning conditional on the propensity score
  • Remark 1