Table of Contents
Fetching ...

Scalable Early Childhood Reading Performance Prediction

Zhongkai Shangguan, Zanming Huang, Eshed Ohn-Bar, Ola Ozernov-Palchik, Derek Kosty, Michael Stoolmiller, Hank Fien

TL;DR

The paper tackles the challenge of predicting early reading gains in children by introducing the Enhanced Core Reading Instruction (ECRI) dataset—an unprecedented large-scale, longitudinal tabular benchmark collected from 44 schools with 6,916 students and 172 teachers. It formulates the task as a binary classification over $d=16$ input features to predict the probability $y\in[0,1]$ of sufficient progress in reading, using two target skills: word identification and word attack. To address heavy missingness without imputation, the authors propose MaskMLP, a self-supervised pre-training approach that masks observed features and aligns the embeddings of original and masked inputs via a cosine embedding loss, followed by supervised fine-tuning. Empirical results show MaskMLP outperforms baselines across school- and student-split generalization settings, with notable gains for students receiving additional intervention, and statistical evidence supporting its superiority over competing self-supervised methods. The work emphasizes responsible deployment, bias considerations, and data transparency, and provides open data and code to spur further research in proactive, personalized educational interventions with robust handling of missing data and diverse student profiles.

Abstract

Models for student reading performance can empower educators and institutions to proactively identify at-risk students, thereby enabling early and tailored instructional interventions. However, there are no suitable publicly available educational datasets for modeling and predicting future reading performance. In this work, we introduce the Enhanced Core Reading Instruction ECRI dataset, a novel large-scale longitudinal tabular dataset collected across 44 schools with 6,916 students and 172 teachers. We leverage the dataset to empirically evaluate the ability of state-of-the-art machine learning models to recognize early childhood educational patterns in multivariate and partial measurements. Specifically, we demonstrate a simple self-supervised strategy in which a Multi-Layer Perception (MLP) network is pre-trained over masked inputs to outperform several strong baselines while generalizing over diverse educational settings. To facilitate future developments in precise modeling and responsible use of models for individualized and early intervention strategies, our data and code are available at https://ecri-data.github.io/.

Scalable Early Childhood Reading Performance Prediction

TL;DR

The paper tackles the challenge of predicting early reading gains in children by introducing the Enhanced Core Reading Instruction (ECRI) dataset—an unprecedented large-scale, longitudinal tabular benchmark collected from 44 schools with 6,916 students and 172 teachers. It formulates the task as a binary classification over input features to predict the probability of sufficient progress in reading, using two target skills: word identification and word attack. To address heavy missingness without imputation, the authors propose MaskMLP, a self-supervised pre-training approach that masks observed features and aligns the embeddings of original and masked inputs via a cosine embedding loss, followed by supervised fine-tuning. Empirical results show MaskMLP outperforms baselines across school- and student-split generalization settings, with notable gains for students receiving additional intervention, and statistical evidence supporting its superiority over competing self-supervised methods. The work emphasizes responsible deployment, bias considerations, and data transparency, and provides open data and code to spur further research in proactive, personalized educational interventions with robust handling of missing data and diverse student profiles.

Abstract

Models for student reading performance can empower educators and institutions to proactively identify at-risk students, thereby enabling early and tailored instructional interventions. However, there are no suitable publicly available educational datasets for modeling and predicting future reading performance. In this work, we introduce the Enhanced Core Reading Instruction ECRI dataset, a novel large-scale longitudinal tabular dataset collected across 44 schools with 6,916 students and 172 teachers. We leverage the dataset to empirically evaluate the ability of state-of-the-art machine learning models to recognize early childhood educational patterns in multivariate and partial measurements. Specifically, we demonstrate a simple self-supervised strategy in which a Multi-Layer Perception (MLP) network is pre-trained over masked inputs to outperform several strong baselines while generalizing over diverse educational settings. To facilitate future developments in precise modeling and responsible use of models for individualized and early intervention strategies, our data and code are available at https://ecri-data.github.io/.

Paper Structure

This paper contains 15 sections, 1 equation, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Self-Supervised MLP Pre-Training. We randomly mask parts of the input variables, i.e., as missing values, and train the model using a loss derived from both the original and masked input to a common feature extractor (we employ a cosine embedding loss to enforce similarity among the two embeddings).
  • Figure 2: Breakdown Results Across Quantile Groups on ECRI. A five-class model accuracy breakdown with classes defined over five quantiles of student performance by improvement amount from the first assessment, from large regression, slight regression, no change, slight improvement, and large improvement.
  • Figure 3: Visualization of t-SNE-based Embedding and Student Profile Analysis. The visualization uses embeddings derived from the MLP (left) and MaskMLP (right) models for the word identification task, with negative samples shaded in gray and positive samples shaded in green. The pre-training step in MaskMLP results in an embedding with greater separation among student profiles.
  • Figure 4: Feature Importance Analysis. We show feature importance by plotting the decrease in model accuracy on word identification classification after dropping each input variable, one at a time. All literacy assessment measures used for input are obtained at the start of the school year.
  • Figure 5: Race and Ethnicity Distribution across Different Socio-Economic Groups. Left: Distribution for the five schools with the highest percentage of students receiving free or reduced-price lunch. Right: Distribution for the five schools with the lowest percentage of students receiving free or reduced-price lunch.
  • ...and 1 more figures