Data Augmentation for Sparse Multidimensional Learning Performance Data Using Generative AI

Liang Zhang; Jionghao Lin; John Sabatini; Conrad Borchers; Daniel Weitekamp; Meng Cao; John Hollander; Xiangen Hu; Arthur C. Graesser

Data Augmentation for Sparse Multidimensional Learning Performance Data Using Generative AI

Liang Zhang, Jionghao Lin, John Sabatini, Conrad Borchers, Daniel Weitekamp, Meng Cao, John Hollander, Xiangen Hu, Arthur C. Graesser

TL;DR

This work tackles extreme sparsity in learning-performance data arising in ITS by proposing a systematic augmentation framework that combines tensor factorization-based data imputation with GenAI-based data augmentation. It models learning events as a 3D tensor over learners, questions, and attempts and uses tensor factorization to densely impute missing values before generating augmented samples with GANs and GPT-4o to capture individualized patterns. Empirical results on AutoTutor ARC show that tensor factorization provides superior imputation accuracy relative to BKT, PFA, and SPARFA-Lite, while GAN-based augmentation offers stability and GPT-4o can achieve higher fidelity in certain regimes. The framework enables scalable, diverse data generation to improve knowledge tracing, reduce biases, and support more robust, personalized ITS capabilities.

Abstract

Learning performance data describe correct and incorrect answers or problem-solving attempts in adaptive learning, such as in intelligent tutoring systems (ITSs). Learning performance data tend to be highly sparse (80\%$\sim$90\% missing observations) in most real-world applications due to adaptive item selection. This data sparsity presents challenges to using learner models to effectively predict future performance explore new hypotheses about learning. This article proposes a systematic framework for augmenting learner data to address data sparsity in learning performance data. First, learning performance is represented as a three-dimensional tensor of learners' questions, answers, and attempts, capturing longitudinal knowledge states during learning. Second, a tensor factorization method is used to impute missing values in sparse tensors of collected learner data, thereby grounding the imputation on knowledge tracing tasks that predict missing performance values based on real observations. Third, a module for generating patterns of learning is used. This study contrasts two forms of generative Artificial Intelligence (AI), including Generative Adversarial Networks (GANs) and Generate Pre-Trained Transformers (GPT) to generate data associated with different clusters of learner data. We tested this approach on an adult literacy dataset from AutoTutor lessons developed for Adult Reading Comprehension (ARC). We found that: (1) tensor factorization improved the performance in tracing and predicting knowledge mastery compared with other knowledge tracing techniques without data augmentation, showing higher relative fidelity for this imputation method, and (2) the GAN-based simulation showed greater overall stability and less statistical bias based on a divergence evaluation with varying simulation sample sizes compared to GPT.

Data Augmentation for Sparse Multidimensional Learning Performance Data Using Generative AI

TL;DR

Abstract

90\% missing observations) in most real-world applications due to adaptive item selection. This data sparsity presents challenges to using learner models to effectively predict future performance explore new hypotheses about learning. This article proposes a systematic framework for augmenting learner data to address data sparsity in learning performance data. First, learning performance is represented as a three-dimensional tensor of learners' questions, answers, and attempts, capturing longitudinal knowledge states during learning. Second, a tensor factorization method is used to impute missing values in sparse tensors of collected learner data, thereby grounding the imputation on knowledge tracing tasks that predict missing performance values based on real observations. Third, a module for generating patterns of learning is used. This study contrasts two forms of generative Artificial Intelligence (AI), including Generative Adversarial Networks (GANs) and Generate Pre-Trained Transformers (GPT) to generate data associated with different clusters of learner data. We tested this approach on an adult literacy dataset from AutoTutor lessons developed for Adult Reading Comprehension (ARC). We found that: (1) tensor factorization improved the performance in tracing and predicting knowledge mastery compared with other knowledge tracing techniques without data augmentation, showing higher relative fidelity for this imputation method, and (2) the GAN-based simulation showed greater overall stability and less statistical bias based on a divergence evaluation with varying simulation sample sizes compared to GPT.

Paper Structure (21 sections, 4 equations, 11 figures, 5 tables)

This paper contains 21 sections, 4 equations, 11 figures, 5 tables.

Introduction
Related Work
Intelligent Tutoring Systems for Adult Reading Comprehension
Tensor-based Imputation for Sparse Performance Data
Generative AI for Augmenting Sparse Educational Data
Dataset
Methods
The Systematic Augmentation Framework
Construction of 3D Tensor for Learning Performance Data
Tensor-based Imputation
Identification of Learning Performance Patterns by Clustering
Data Augmentation based on Generative Models
Experimental Setup and Evaluation
Results
Sparsity Measurement and Latent Features Obtained by Tensor Factorization
...and 6 more sections

Figures (11)

Figure 1: Data sparsity issues in learning performance data in Intelligent Tutoring Systems.
Figure 2: The systematic augmentation framework for learning performance in Intelligent Tutoring System.
Figure 3: Estimates of parameter estimates $a$ and $b$ and identification of learning performance patterns through clustering.
Figure 4: EMD measurement for scalable sampling by data augmentation. The original sample size is 20, and the augmentations are shown in increments of 1,000, with total sizes ranging from 1,000 to 20,000.
Figure 5: Comparison of EMD measurement for data augmentation between Vanilla GAN and GPT-4o.
...and 6 more figures

Data Augmentation for Sparse Multidimensional Learning Performance Data Using Generative AI

TL;DR

Abstract

Data Augmentation for Sparse Multidimensional Learning Performance Data Using Generative AI

Authors

TL;DR

Abstract

Table of Contents

Figures (11)