Table of Contents
Fetching ...

Automated Knowledge Concept Annotation and Question Representation Learning for Knowledge Tracing

Yilmazcan Ozyurt, Stefan Feuerriegel, Mrinmaya Sachan

TL;DR

This work addresses two key KT limitations: the need for manual, error-prone KC annotations and the neglect of semantic content in questions and KCs. It introduces KCQRL, a framework with three modules: (i) automated KC annotation via chain-of-thought prompting of LLMs to generate solution steps and map them to KCs, (ii) contrastive learning-based representation learning that aligns question content and solution steps with KCs while mitigating false negatives through KC clustering, and (iii) integration of the learned question embeddings into existing KT models by replacing random embeddings and aggregating step-level information. The approach yields consistent improvements across 15 KT models on two large Math datasets, with notable gains in low-data settings and for weaker baselines, and is shown to scale without adding runtime overhead to the KT models. The results demonstrate the value of incorporating semantic context into KT and suggest broad applicability across domains and KT architectures, potentially reducing reliance on expert KC annotations in practice.

Abstract

Knowledge tracing (KT) is a popular approach for modeling students' learning progress over time, which can enable more personalized and adaptive learning. However, existing KT approaches face two major limitations: (1) they rely heavily on expert-defined knowledge concepts (KCs) in questions, which is time-consuming and prone to errors; and (2) KT methods tend to overlook the semantics of both questions and the given KCs. In this work, we address these challenges and present KCQRL, a framework for automated knowledge concept annotation and question representation learning that can improve the effectiveness of any existing KT model. First, we propose an automated KC annotation process using large language models (LLMs), which generates question solutions and then annotates KCs in each solution step of the questions. Second, we introduce a contrastive learning approach to generate semantically rich embeddings for questions and solution steps, aligning them with their associated KCs via a tailored false negative elimination approach. These embeddings can be readily integrated into existing KT models, replacing their randomly initialized embeddings. We demonstrate the effectiveness of KCQRL across 15 KT algorithms on two large real-world Math learning datasets, where we achieve consistent performance improvements.

Automated Knowledge Concept Annotation and Question Representation Learning for Knowledge Tracing

TL;DR

This work addresses two key KT limitations: the need for manual, error-prone KC annotations and the neglect of semantic content in questions and KCs. It introduces KCQRL, a framework with three modules: (i) automated KC annotation via chain-of-thought prompting of LLMs to generate solution steps and map them to KCs, (ii) contrastive learning-based representation learning that aligns question content and solution steps with KCs while mitigating false negatives through KC clustering, and (iii) integration of the learned question embeddings into existing KT models by replacing random embeddings and aggregating step-level information. The approach yields consistent improvements across 15 KT models on two large Math datasets, with notable gains in low-data settings and for weaker baselines, and is shown to scale without adding runtime overhead to the KT models. The results demonstrate the value of incorporating semantic context into KT and suggest broad applicability across domains and KT architectures, potentially reducing reliance on expert KC annotations in practice.

Abstract

Knowledge tracing (KT) is a popular approach for modeling students' learning progress over time, which can enable more personalized and adaptive learning. However, existing KT approaches face two major limitations: (1) they rely heavily on expert-defined knowledge concepts (KCs) in questions, which is time-consuming and prone to errors; and (2) KT methods tend to overlook the semantics of both questions and the given KCs. In this work, we address these challenges and present KCQRL, a framework for automated knowledge concept annotation and question representation learning that can improve the effectiveness of any existing KT model. First, we propose an automated KC annotation process using large language models (LLMs), which generates question solutions and then annotates KCs in each solution step of the questions. Second, we introduce a contrastive learning approach to generate semantically rich embeddings for questions and solution steps, aligning them with their associated KCs via a tailored false negative elimination approach. These embeddings can be readily integrated into existing KT models, replacing their randomly initialized embeddings. We demonstrate the effectiveness of KCQRL across 15 KT algorithms on two large real-world Math learning datasets, where we achieve consistent performance improvements.
Paper Structure (32 sections, 8 equations, 15 figures, 5 tables)

This paper contains 32 sections, 8 equations, 15 figures, 5 tables.

Figures (15)

  • Figure 1: Overview of standard KT formulation and its limitations.
  • Figure 2: Overview of our KCQRL framework and how it can be applied on top of existing KT models.Top left: simplified illustration of the standard KT formulation, where the embeddings of questions and/or KC identifiers are initialized randomly for the prediction task. Our KCQRL improves the standard KT formulation via three modules: (1)Bottom left: shows how question IDs are translated into question content, solution steps (simplified for readability), and KCs via KC annotation (Sec. \ref{['sec:kc_annot']}). (2)Bottom right: shows how these annotations are leveraged for representation learning of questions via a tailored contrastive learning and false negative elimination (Sec. \ref{['sec:kc_infer']}). (3)Top right: shows how these learned representations initialize the embeddings of a KT model to improve the performance of the latter (Sec. \ref{['sec:kt_imprv']}).
  • Figure 3: Improvement of our KCQRL across different training set sizes. Plots show different KT models, where, on the x-axis, we report the performance when varying the number of students in our datasets. Green area covers the improvement from our framework.
  • Figure 4: Improvement of our KCQRL in multi-step-ahead prediction. Plots show different KT models, where we vary the portion of observed learning history and predict the rest of the entire learning journey. Green area shows the improvement from using our framework.
  • Figure 5: Visualization of question embeddings. For better intuition, we chose the same question from Fig. \ref{['fig:framework']}, and, for each of its KCs, we color the question representations sharing the same KC. Evidently, our CL loss is highly effective.
  • ...and 10 more figures