Table of Contents
Fetching ...

Temporal and Between-Group Variability in College Dropout Prediction

Dominik Glandorf, Hye Rin Lee, Gabe Avakian Orona, Marina Pumptow, Renzhe Yu, Christian Fischer

TL;DR

The paper tackles college dropout prediction using large-scale administrative data, addressing how predictive performance and predictor importance evolve over time and across student groups. It compares multiple machine learning models and demonstrates that Random Forest generally achieves the best performance, with AUROC and AUPRC improving as predictions are made later in the student trajectory. Early predictors (demographics and pre-entry factors) give way to college-performance and enrollment-behavior indicators, with GPA remaining particularly important for historically disadvantaged groups. The study provides actionable insights for designing time-aware, subgroup-sensitive early warning systems in higher education, emphasizing cost-effective data use and the careful interpretation of predictor importance across different policy goals.

Abstract

Large-scale administrative data is a common input in early warning systems for college dropout in higher education. Still, the terminology and methodology vary significantly across existing studies, and the implications of different modeling decisions are not fully understood. This study provides a systematic evaluation of contributing factors and predictive performance of machine learning models over time and across different student groups. Drawing on twelve years of administrative data at a large public university in the US, we find that dropout prediction at the end of the second year has a 20% higher AUC than at the time of enrollment in a Random Forest model. Also, most predictive factors at the time of enrollment, including demographics and high school performance, are quickly superseded in predictive importance by college performance and in later stages by enrollment behavior. Regarding variability across student groups, college GPA has more predictive value for students from traditionally disadvantaged backgrounds than their peers. These results can help researchers and administrators understand the comparative value of different data sources when building early warning systems and optimizing decisions under specific policy goals.

Temporal and Between-Group Variability in College Dropout Prediction

TL;DR

The paper tackles college dropout prediction using large-scale administrative data, addressing how predictive performance and predictor importance evolve over time and across student groups. It compares multiple machine learning models and demonstrates that Random Forest generally achieves the best performance, with AUROC and AUPRC improving as predictions are made later in the student trajectory. Early predictors (demographics and pre-entry factors) give way to college-performance and enrollment-behavior indicators, with GPA remaining particularly important for historically disadvantaged groups. The study provides actionable insights for designing time-aware, subgroup-sensitive early warning systems in higher education, emphasizing cost-effective data use and the careful interpretation of predictor importance across different policy goals.

Abstract

Large-scale administrative data is a common input in early warning systems for college dropout in higher education. Still, the terminology and methodology vary significantly across existing studies, and the implications of different modeling decisions are not fully understood. This study provides a systematic evaluation of contributing factors and predictive performance of machine learning models over time and across different student groups. Drawing on twelve years of administrative data at a large public university in the US, we find that dropout prediction at the end of the second year has a 20% higher AUC than at the time of enrollment in a Random Forest model. Also, most predictive factors at the time of enrollment, including demographics and high school performance, are quickly superseded in predictive importance by college performance and in later stages by enrollment behavior. Regarding variability across student groups, college GPA has more predictive value for students from traditionally disadvantaged backgrounds than their peers. These results can help researchers and administrators understand the comparative value of different data sources when building early warning systems and optimizing decisions under specific policy goals.
Paper Structure (28 sections, 4 figures, 3 tables)

This paper contains 28 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Performance metrics for different time points of predictions against their respective baselines. PRC: precision-recall curve, ROC: receiver-operator curve. The baseline is a random prediction for curve-based metrics, while based on the best possible threshold for accuracy.
  • Figure 2: Predictor importance for different time points of predictions. Predictors always below 2.5% are omitted. Due to the root transformation of the importance for better readability, differences in the area below 5% may seem aggravated.
  • Figure 3: Model performance and population sizes by grouping factors. Both are based on the data available one year after initial enrollment.
  • Figure 4: Differences in predictor importance between groups when predicting dropout one year after initial enrollment. Twenty-nine predictors with a maximal score of 1% or below for every group are omitted in this plot.