Table of Contents
Fetching ...

Machine Learning Predicts Upper Secondary Education Dropout as Early as the End of Primary School

Maria Psyridou, Fabi Prezja, Minna Torppa, Marja-Kristiina Lerkkanen, Anna-Maija Poikkeus, Kati Vasalampi

TL;DR

The machine learning models developed in this study demonstrated notable classification ability, and may have the potential to proactively support educators’ processes and existing protocols for identifying at-risk students, thereby potentially aiding in the reinvention of student retention and success strategies and ultimately contributing to improved educational outcomes.

Abstract

Education plays a pivotal role in alleviating poverty, driving economic growth, and empowering individuals, thereby significantly influencing societal and personal development. However, the persistent issue of school dropout poses a significant challenge, with its effects extending beyond the individual. While previous research has employed machine learning for dropout classification, these studies often suffer from a short-term focus, relying on data collected only a few years into the study period. This study expanded the modeling horizon by utilizing a 13-year longitudinal dataset, encompassing data from kindergarten to Grade 9. Our methodology incorporated a comprehensive range of parameters, including students' academic and cognitive skills, motivation, behavior, well-being, and officially recorded dropout data. The machine learning models developed in this study demonstrated notable classification ability, achieving a mean area under the curve (AUC) of 0.61 with data up to Grade 6 and an improved AUC of 0.65 with data up to Grade 9. Further data collection and independent correlational and causal analyses are crucial. In future iterations, such models may have the potential to proactively support educators' processes and existing protocols for identifying at-risk students, thereby potentially aiding in the reinvention of student retention and success strategies and ultimately contributing to improved educational outcomes.

Machine Learning Predicts Upper Secondary Education Dropout as Early as the End of Primary School

TL;DR

The machine learning models developed in this study demonstrated notable classification ability, and may have the potential to proactively support educators’ processes and existing protocols for identifying at-risk students, thereby potentially aiding in the reinvention of student retention and success strategies and ultimately contributing to improved educational outcomes.

Abstract

Education plays a pivotal role in alleviating poverty, driving economic growth, and empowering individuals, thereby significantly influencing societal and personal development. However, the persistent issue of school dropout poses a significant challenge, with its effects extending beyond the individual. While previous research has employed machine learning for dropout classification, these studies often suffer from a short-term focus, relying on data collected only a few years into the study period. This study expanded the modeling horizon by utilizing a 13-year longitudinal dataset, encompassing data from kindergarten to Grade 9. Our methodology incorporated a comprehensive range of parameters, including students' academic and cognitive skills, motivation, behavior, well-being, and officially recorded dropout data. The machine learning models developed in this study demonstrated notable classification ability, achieving a mean area under the curve (AUC) of 0.61 with data up to Grade 6 and an improved AUC of 0.65 with data up to Grade 9. Further data collection and independent correlational and causal analyses are crucial. In future iterations, such models may have the potential to proactively support educators' processes and existing protocols for identifying at-risk students, thereby potentially aiding in the reinvention of student retention and success strategies and ultimately contributing to improved educational outcomes.
Paper Structure (17 sections, 11 equations, 6 figures, 2 tables)

This paper contains 17 sections, 11 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Confusion matrices for classifiers using data up to Grade 9 (first row) and up to Grade 6 (second row) averaged across all folds in six-fold cross-validation.
  • Figure 2: The ROC Curves for the B-RandomForest classifiers from cross-validation. (a) Curve for the B-RandomForest classifier trained using data up to Grade 9. (b) Curve for another classifier instance trained using data up to Grade 6.
  • Figure 3: The top ranked 20 features for the B-RandomForest using data up to Grade 9. Features are listed in order of average score from top to bottom. The scores are averages from across all folds of the six-fold cross-validation. The features listed pertain to: READ2=Reading fluency, Grade 2; READ4=Reading fluency, Grade 4; READ3=Reading fluency, Grade 3; READ1=Reading fluency, Grade 1; RAN=Rapid Automatized Naming, Kindergarten; multSC7=Multiplication, Grade 4; ariSC4=Arithmetic, Grade 1 spring; ly1C5C=Reading comprehension, Grade 2; ariSC6=Arithmetic, Grade 3; ly4C7C=Reading comprehension, Grade 4; ly1C4C=Reading comprehension, Grade 1; ariSC7=Arithmetic, Grade 4; ariSC5=Arithmetic, Grade 2; ppvSC2=Vocabulary, Kindergarten; pisaC10total_sum=PISA, Grade 9; ariSC3=Arithmetic, Grade 1 fall; multSC9=Multiplication, Grade 7; READ9=Reading fluency, Grade 9; READ7=Reading fluency, Grade 7; ly1C6C=Reading comprehension, Grade 3
  • Figure 4: The top ranked 20 features for the B-RandomForest using data up to Grade 6. Features are listed in order of average score from top to bottom. The scores are averages from across all folds of the six-fold cross-validation. READ1=Reading fluency, Grade 1; READ2=Reading fluency, Grade 2; READ4=Reading fluency, Grade 4; multSC7=Multiplication, Grade 4; READ3=Reading fluency, Grade 3; RAN=Rapid Automatized Naming, Kindergarten; ariSC4=Arithmetic, Grade 1 spring; multSC8=Multiplication, Grade 6; READ6=Reading fluency, Grade 6; ariSC6=Arithmetic, Grade 3; ly1C5C=Reading comprehension, Grade 2; ariSC7=Arithmetic, Grade 4; ariSC5=Arithmetic, Grade 2; voedo=Parental education; ly4C7C=Reading comprehension, Grade 4; ly1C4C=Reading comprehension, Grade 1; ppvSC2=Vocabulary, Kindergarten; tavma_g6=Task value for math, Grade 6; ariSC3=Arithmetic, Grade 1 fall; ly6C8C=Reading comprehension, Grade 6
  • Figure 5: Proposed research workflow. Our process begins with data collection over 13 years, from kindergarten to the end of upper secondary school (Step 1), followed by data processing which includes cleaning and imputing missing feature values (Step 2). We then apply four machine learning models for dropout and non-dropout classification (Step 3), and evaluate these models using 6-fold cross-validation, focusing on performance metrics and ROC curves (Step 4).
  • ...and 1 more figures