Table of Contents
Fetching ...

Language Bottleneck Models for Qualitative Knowledge State Modeling

Antonin Berthon, Mihaela van der Schaar

TL;DR

The paper tackles insufficient interpretability in traditional cognitive diagnosis and knowledge tracing by introducing Language Bottleneck Models (LBMs) that compress a student’s interaction history into a textual knowledge-state summary $ ilde{\mathcal{S}}$, which then drives predictions via a decoder. By casting knowledge state modeling as an inverse problem over text, LBMs yield human-readable explanations that can surface nuanced misconceptions while maintaining competitive predictive accuracy and improved sample efficiency. The encoder is trained with reinforcement learning using a decoder-centered reward, while the decoder undergoes supervised fine-tuning; steering mechanisms and ablation studies demonstrate that the textual bottleneck supports targeted pedagogy and robust performance across synthetic and real-world datasets. The approach promises enhanced interpretability and actionable diagnostics in education, with broader applicability to domains where compact textual state representations can forecast future behavior.

Abstract

Accurately assessing student knowledge is central to education. Cognitive Diagnosis (CD) models estimate student proficiency at a fixed point in time, while Knowledge Tracing (KT) methods model evolving knowledge states to predict future performance. However, existing approaches either provide quantitative concept mastery estimates with limited expressivity (CD, probabilistic KT) or prioritize predictive accuracy at the cost of interpretability (deep learning KT). We propose Language Bottleneck Models (LBMs), where an encoder LLM produces textual knowledge state summaries, which a decoder LLM uses to predict future performance. This produces interpretable summaries that can express nuanced insights--such as misconceptions--that CD and KT models cannot capture. Extensive validation across synthetic and real-world datasets shows LBMs reveal qualitative insights beyond what CD and KT models can capture, while achieving competitive accuracy with improved sample efficiency. We demonstrate that the encoder and decoder can be fine-tuned with reinforcement learning and supervised fine-tuning respectively to improve both summary quality and predictive performance.

Language Bottleneck Models for Qualitative Knowledge State Modeling

TL;DR

The paper tackles insufficient interpretability in traditional cognitive diagnosis and knowledge tracing by introducing Language Bottleneck Models (LBMs) that compress a student’s interaction history into a textual knowledge-state summary , which then drives predictions via a decoder. By casting knowledge state modeling as an inverse problem over text, LBMs yield human-readable explanations that can surface nuanced misconceptions while maintaining competitive predictive accuracy and improved sample efficiency. The encoder is trained with reinforcement learning using a decoder-centered reward, while the decoder undergoes supervised fine-tuning; steering mechanisms and ablation studies demonstrate that the textual bottleneck supports targeted pedagogy and robust performance across synthetic and real-world datasets. The approach promises enhanced interpretability and actionable diagnostics in education, with broader applicability to domains where compact textual state representations can forecast future behavior.

Abstract

Accurately assessing student knowledge is central to education. Cognitive Diagnosis (CD) models estimate student proficiency at a fixed point in time, while Knowledge Tracing (KT) methods model evolving knowledge states to predict future performance. However, existing approaches either provide quantitative concept mastery estimates with limited expressivity (CD, probabilistic KT) or prioritize predictive accuracy at the cost of interpretability (deep learning KT). We propose Language Bottleneck Models (LBMs), where an encoder LLM produces textual knowledge state summaries, which a decoder LLM uses to predict future performance. This produces interpretable summaries that can express nuanced insights--such as misconceptions--that CD and KT models cannot capture. Extensive validation across synthetic and real-world datasets shows LBMs reveal qualitative insights beyond what CD and KT models can capture, while achieving competitive accuracy with improved sample efficiency. We demonstrate that the encoder and decoder can be fine-tuned with reinforcement learning and supervised fine-tuning respectively to improve both summary quality and predictive performance.

Paper Structure

This paper contains 112 sections, 4 equations, 13 figures, 11 tables.

Figures (13)

  • Figure 1: Language Bottleneck Models for Knowledge Modeling. (A) Past and future behavior $\mathcal{X}$ and $\mathcal{Y}$ are caused by a certain knowledge state $\mathcal{S}$ held by the student when answering questions. (B) CD and KT models represent the knowledge state via quantitative proficiency vectors or opaque latent embeddings. (C) LBMs approximate the knowledge state using natural language summaries. The example knowledge states shown are taken from the case-study presented in Section \ref{['sec:qualitative_insight']}.
  • Figure 1: Overview of datasets. AVG#log and STD#log>1 are defined following wang2022neuralcd as respectively the average number of logs per student per KC, and the mean standard deviation of score per student and per KC.
  • Figure 2: Accuracy ± SEM (N=200) on Synthetic dataset given ground truth knowledge state summaries.
  • Figure 3: Case study: comparing CD and LBM knowledge states. Given a student from the Synthetic dataset, we compare proficiency estimates across knowledge concepts (KCs) obtained from a trained NeuralCDM model to the text-based knowledge state generated by a trained LBM model.
  • Figure 4: Systematic evaluation of the summaries produced by different encoder LLMs over 200 test students from the Synthetic dataset: (Left) Average construct mastery accuracy across constructs (%) and average overall score (1-5 scale); (Middle) Average misconception detection rate (%) and average number of misconception false positives per summary (count); (Right) Confidence Index (1-3, 1=under-confident, 2=appropriately calibrated, 3=over-confident) and Specificity Index (1-3, higher is more specific). Error bars represent the standard error (N=200).
  • ...and 8 more figures