Leveraging Language Models for Analyzing Longitudinal Experiential Data in Education
Ahatsham Hayat, Bilal Khan, Mohammad Rashedul Hasan
TL;DR
The paper addresses forecasting STEM students' academic trajectories from high-dimensional longitudinal experiential data, a task hindered by missing values, small sample sizes, and complex temporal dynamics. It introduces a data-enrichment pipeline that converts cognitive scores to text and fuses static background data with time-series non-cognitive and cognitive information to fine-tune pre-trained transformer-based LLMs (decoder-only and encoder-decoder) on 2-, 3-, and 4-week inputs. Key contributions include a 78-dimensional multi-modal dataset, a tailored data enrichment methodology (including missing-value descriptors and data augmentation), and a thorough empirical comparison showing that LLMs can integrate modalities and handle missing data but largely rely on surface statistical patterns rather than deep temporal understanding. The findings demonstrate high early accuracy, particularly when using multi-modal data, while also revealing limitations in utilizing explicit temporal information, underscoring the need for larger datasets and advances in temporal modeling to support just-in-time educational interventions.
Abstract
We propose a novel approach to leveraging pre-trained language models (LMs) for early forecasting of academic trajectories in STEM students using high-dimensional longitudinal experiential data. This data, which captures students' study-related activities, behaviors, and psychological states, offers valuable insights for forecasting-based interventions. Key challenges in handling such data include high rates of missing values, limited dataset size due to costly data collection, and complex temporal variability across modalities. Our approach addresses these issues through a comprehensive data enrichment process, integrating strategies for managing missing values, augmenting data, and embedding task-specific instructions and contextual cues to enhance the models' capacity for learning temporal patterns. Through extensive experiments on a curated student learning dataset, we evaluate both encoder-decoder and decoder-only LMs. While our findings show that LMs effectively integrate data across modalities and exhibit resilience to missing data, they primarily rely on high-level statistical patterns rather than demonstrating a deeper understanding of temporal dynamics. Furthermore, their ability to interpret explicit temporal information remains limited. This work advances educational data science by highlighting both the potential and limitations of LMs in modeling student trajectories for early intervention based on longitudinal experiential data.
