Scalable Early Childhood Reading Performance Prediction
Zhongkai Shangguan, Zanming Huang, Eshed Ohn-Bar, Ola Ozernov-Palchik, Derek Kosty, Michael Stoolmiller, Hank Fien
TL;DR
The paper tackles the challenge of predicting early reading gains in children by introducing the Enhanced Core Reading Instruction (ECRI) dataset—an unprecedented large-scale, longitudinal tabular benchmark collected from 44 schools with 6,916 students and 172 teachers. It formulates the task as a binary classification over $d=16$ input features to predict the probability $y\in[0,1]$ of sufficient progress in reading, using two target skills: word identification and word attack. To address heavy missingness without imputation, the authors propose MaskMLP, a self-supervised pre-training approach that masks observed features and aligns the embeddings of original and masked inputs via a cosine embedding loss, followed by supervised fine-tuning. Empirical results show MaskMLP outperforms baselines across school- and student-split generalization settings, with notable gains for students receiving additional intervention, and statistical evidence supporting its superiority over competing self-supervised methods. The work emphasizes responsible deployment, bias considerations, and data transparency, and provides open data and code to spur further research in proactive, personalized educational interventions with robust handling of missing data and diverse student profiles.
Abstract
Models for student reading performance can empower educators and institutions to proactively identify at-risk students, thereby enabling early and tailored instructional interventions. However, there are no suitable publicly available educational datasets for modeling and predicting future reading performance. In this work, we introduce the Enhanced Core Reading Instruction ECRI dataset, a novel large-scale longitudinal tabular dataset collected across 44 schools with 6,916 students and 172 teachers. We leverage the dataset to empirically evaluate the ability of state-of-the-art machine learning models to recognize early childhood educational patterns in multivariate and partial measurements. Specifically, we demonstrate a simple self-supervised strategy in which a Multi-Layer Perception (MLP) network is pre-trained over masked inputs to outperform several strong baselines while generalizing over diverse educational settings. To facilitate future developments in precise modeling and responsible use of models for individualized and early intervention strategies, our data and code are available at https://ecri-data.github.io/.
