Towards Robust Knowledge Tracing Models via k-Sparse Attention

Shuyan Huang; Zitao Liu; Xiangyu Zhao; Weiqi Luo; Jian Weng

Towards Robust Knowledge Tracing Models via k-Sparse Attention

Shuyan Huang, Zitao Liu, Xiangyu Zhao, Weiqi Luo, Jian Weng

TL;DR

This work tackles overfitting in attention-based knowledge tracing by introducing sparseKT, a k-sparse attention framework that retains only the top-$k$ historical interactions after a self-attention pass. It includes two sparsification strategies, soft-thresholding and top-$K$, and augments interaction embeddings with a question-specific discrimination factor to improve robustness without sacrificing accuracy. Empirical results on three public educational datasets show sparseKT achieves competitive AUC/accuracy and often ranks in the top tier, improving generalization over SAKT and other baselines, with transparent KC relation visualizations supporting interpretability. The approach is open-source, enabling reproducibility and practical adoption in educational settings.

Abstract

Knowledge tracing (KT) is the problem of predicting students' future performance based on their historical interaction sequences. With the advanced capability of capturing contextual long-term dependency, attention mechanism becomes one of the essential components in many deep learning based KT (DLKT) models. In spite of the impressive performance achieved by these attentional DLKT models, many of them are often vulnerable to run the risk of overfitting, especially on small-scale educational datasets. Therefore, in this paper, we propose \textsc{sparseKT}, a simple yet effective framework to improve the robustness and generalization of the attention based DLKT approaches. Specifically, we incorporate a k-selection module to only pick items with the highest attention scores. We propose two sparsification heuristics : (1) soft-thresholding sparse attention and (2) top-$K$ sparse attention. We show that our \textsc{sparseKT} is able to help attentional KT models get rid of irrelevant student interactions and have comparable predictive performance when compared to 11 state-of-the-art KT models on three publicly available real-world educational datasets. To encourage reproducible research, we make our data and code publicly available at \url{https://github.com/pykt-team/pykt-toolkit}\footnote{We merged our model to the \textsc{pyKT} benchmark at \url{https://pykt.org/}.}.

Towards Robust Knowledge Tracing Models via k-Sparse Attention

TL;DR

This work tackles overfitting in attention-based knowledge tracing by introducing sparseKT, a k-sparse attention framework that retains only the top-

historical interactions after a self-attention pass. It includes two sparsification strategies, soft-thresholding and top-

, and augments interaction embeddings with a question-specific discrimination factor to improve robustness without sacrificing accuracy. Empirical results on three public educational datasets show sparseKT achieves competitive AUC/accuracy and often ranks in the top tier, improving generalization over SAKT and other baselines, with transparent KC relation visualizations supporting interpretability. The approach is open-source, enabling reproducibility and practical adoption in educational settings.

Abstract

sparse attention. We show that our \textsc{sparseKT} is able to help attentional KT models get rid of irrelevant student interactions and have comparable predictive performance when compared to 11 state-of-the-art KT models on three publicly available real-world educational datasets. To encourage reproducible research, we make our data and code publicly available at \url{https://github.com/pykt-team/pykt-toolkit}\footnote{We merged our model to the \textsc{pyKT} benchmark at \url{https://pykt.org/}.}.

Paper Structure (16 sections, 5 equations, 4 figures, 1 table)

This paper contains 16 sections, 5 equations, 4 figures, 1 table.

Introduction
Preliminary
Self Attentive Knowledge Tracing
Related Work
Attention based Knowledge Tracing
Sparse Attention
The sparseKT Approach
Embedding
k-Sparse Attention
Prediction Layer
Experiments
Results
Overall Performance
Impact of the Sparsity Level
Visualization of KC Relations via $k$-Sparse Attention
...and 1 more sections

Figures (4)

Figure 1: An illustration of the KT problem. A KC is a generality of everyday terms like concept, principle, or skill.
Figure 2: The sparseKT illustration.
Figure 3: AUC performance of different values of $k$ with our sparseKT-soft and sparseKT-topK on AS2015.
Figure 4: Attention weights visualization of sparseKT-topK. The y-axis is the pre-interaction KCs, and the x-axis is the post-interaction KCs.

Towards Robust Knowledge Tracing Models via k-Sparse Attention

TL;DR

Abstract

Towards Robust Knowledge Tracing Models via k-Sparse Attention

Authors

TL;DR

Abstract

Table of Contents

Figures (4)