Table of Contents
Fetching ...

Transporting Predictions via Double Machine Learning: Predicting Partially Unobserved Students' Outcomes

Falco J. Bargagli-Stoffi, Emma Landry, Kevin P. Josey, Kenneth De Beckker, Joana E. Maldonado, Kristof De Witte

Abstract

Educational policymakers often lack data on student outcomes where standardized tests were not administered. Machine learning can predict unobserved outcomes in target populations using source population data. However, covariate distribution differences between populations reduce model transportability, potentially decreasing predictive accuracy and introducing bias. We propose using double machine learning for covariate-shift weighted models. First, we estimate overlap scores -- the probability an observation belongs to the source dataset given covariates. Second, balancing weights, defined as density ratios of target-to-source membership probabilities, reweight individual observations' contributions to the loss function in target outcome prediction models. This downweights source observations less similar to the target population, allowing predictions to rely more on observations with greater overlap. Consequently, predictions become more transportable under covariate shift. We illustrate this framework using student standardized financial literacy scores (FLS) data. Using Bayesian Additive Regression Trees (BART), we predict missing FLS. We find minimal predictive performance differences between weighted and unweighted models, suggesting limited covariate shift in our setting. Nonetheless, our approach provides a principled framework for addressing covariate shift and is broadly applicable to predictive modeling in social and health sciences, where source-target population differences are common.

Transporting Predictions via Double Machine Learning: Predicting Partially Unobserved Students' Outcomes

Abstract

Educational policymakers often lack data on student outcomes where standardized tests were not administered. Machine learning can predict unobserved outcomes in target populations using source population data. However, covariate distribution differences between populations reduce model transportability, potentially decreasing predictive accuracy and introducing bias. We propose using double machine learning for covariate-shift weighted models. First, we estimate overlap scores -- the probability an observation belongs to the source dataset given covariates. Second, balancing weights, defined as density ratios of target-to-source membership probabilities, reweight individual observations' contributions to the loss function in target outcome prediction models. This downweights source observations less similar to the target population, allowing predictions to rely more on observations with greater overlap. Consequently, predictions become more transportable under covariate shift. We illustrate this framework using student standardized financial literacy scores (FLS) data. Using Bayesian Additive Regression Trees (BART), we predict missing FLS. We find minimal predictive performance differences between weighted and unweighted models, suggesting limited covariate shift in our setting. Nonetheless, our approach provides a principled framework for addressing covariate shift and is broadly applicable to predictive modeling in social and health sciences, where source-target population differences are common.

Paper Structure

This paper contains 31 sections, 8 equations, 23 figures, 9 tables.

Figures (23)

  • Figure 1: Boxplot of RMSE values across 40 simulated datasets. Each plot corresponds to one of the three data generation scenarios considered. In blue are the results with the standard BART model; in pink, those with a Random Forest model; and in orange, those obtained with weighted BART model.
  • Figure 2: (Left) Distribution of PISA math Scores in Flanders (blue) and Wallonia (yellow). (Right) Distribution of PISA Wealth Index in Flanders (blue) and Wallonia (yellow).
  • Figure 3: Predicted FLS for Flanders (light blue) and Wallonia (orange). The red line indicates the threshold of the baseline level of proficiency in financial literacy. OECD suggests that students above this threshold of 400 points have financial literacy levels that are sufficient to participate in society OECD2017a.
  • Figure 4: Predicted FLS for Flanders (light blue) and Wallonia (orange) under the weighted BART model. The red line indicates the threshold of the baseline level of proficiency in financial literacy.
  • Figure 5: Comparison of predicted FLS scores under the weighted and unweighted BART models. The dotted line corresponds to the identity: observations are centered around it, indicating minimal discrepancy in predictions between the two models.
  • ...and 18 more figures