Automated Knowledge Concept Annotation and Question Representation Learning for Knowledge Tracing
Yilmazcan Ozyurt, Stefan Feuerriegel, Mrinmaya Sachan
TL;DR
This work addresses two key KT limitations: the need for manual, error-prone KC annotations and the neglect of semantic content in questions and KCs. It introduces KCQRL, a framework with three modules: (i) automated KC annotation via chain-of-thought prompting of LLMs to generate solution steps and map them to KCs, (ii) contrastive learning-based representation learning that aligns question content and solution steps with KCs while mitigating false negatives through KC clustering, and (iii) integration of the learned question embeddings into existing KT models by replacing random embeddings and aggregating step-level information. The approach yields consistent improvements across 15 KT models on two large Math datasets, with notable gains in low-data settings and for weaker baselines, and is shown to scale without adding runtime overhead to the KT models. The results demonstrate the value of incorporating semantic context into KT and suggest broad applicability across domains and KT architectures, potentially reducing reliance on expert KC annotations in practice.
Abstract
Knowledge tracing (KT) is a popular approach for modeling students' learning progress over time, which can enable more personalized and adaptive learning. However, existing KT approaches face two major limitations: (1) they rely heavily on expert-defined knowledge concepts (KCs) in questions, which is time-consuming and prone to errors; and (2) KT methods tend to overlook the semantics of both questions and the given KCs. In this work, we address these challenges and present KCQRL, a framework for automated knowledge concept annotation and question representation learning that can improve the effectiveness of any existing KT model. First, we propose an automated KC annotation process using large language models (LLMs), which generates question solutions and then annotates KCs in each solution step of the questions. Second, we introduce a contrastive learning approach to generate semantically rich embeddings for questions and solution steps, aligning them with their associated KCs via a tailored false negative elimination approach. These embeddings can be readily integrated into existing KT models, replacing their randomly initialized embeddings. We demonstrate the effectiveness of KCQRL across 15 KT algorithms on two large real-world Math learning datasets, where we achieve consistent performance improvements.
