Accurate Multi-Category Student Performance Forecasting at Early Stages of Online Education Using Neural Networks
Naveed Ur Rehman Junejo, Muhammad Wasim Nawaz, Qingsheng Huang, Xiaoqing Dong, Chang Wang, Gengzhong Zheng
TL;DR
The paper tackles multiclass early prediction of online student performance by predicting four outcome categories (Distinction, Pass, Fail, Withdrawn) using a data-analytic pipeline on the OULAD dataset. It introduces a 1D-CNN architecture with a comprehensive preprocessing workflow that merges multiple data sources, engineers features such as total_reg_days, and trains with class weights to handle imbalance. Empirically, the approach significantly outperforms baselines like RF, DFFNN, and ANN-LSTM across accuracy, precision, recall, and F1, achieving around 92% accuracy at 20% course length and about 98% at complete course length, with robust ROC-AUC even in early windows. The work demonstrates strong potential for timely interventions in online education and suggests Transformer-based extensions for even better performance in future research.
Abstract
The ability to accurately predict and analyze student performance in online education, both at the outset and throughout the semester, is vital. Most of the published studies focus on binary classification (Fail or Pass) but there is still a significant research gap in predicting students' performance across multiple categories. This study introduces a novel neural network-based approach capable of accurately predicting student performance and identifying vulnerable students at early stages of the online courses. The Open University Learning Analytics (OULA) dataset is employed to develop and test the proposed model, which predicts outcomes in Distinction, Fail, Pass, and Withdrawn categories. The OULA dataset is preprocessed to extract features from demographic data, assessment data, and clickstream interactions within a Virtual Learning Environment (VLE). Comparative simulations indicate that the proposed model significantly outperforms existing baseline models including Artificial Neural Network Long Short Term Memory (ANN-LSTM), Random Forest (RF) 'gini', RF 'entropy' and Deep Feed Forward Neural Network (DFFNN) in terms of accuracy, precision, recall, and F1-score. The results indicate that the prediction accuracy of the proposed method is about 25% more than the existing state-of-the-art. Furthermore, compared to existing methodologies, the model demonstrates superior predictive capability across temporal course progression, achieving superior accuracy even at the initial 20% phase of course completion.
