Table of Contents
Fetching ...

Knowledge Distillation of LLM for Automatic Scoring of Science Education Assessments

Ehsan Latif, Luyang Fang, Ping Ma, Xiaoming Zhai

TL;DR

The paper tackles automatic scoring in science education by distilling knowledge from a fine-tuned LLM (teacher) into a compact student suitable for resource-constrained devices. It introduces a teacher–student KD framework where the student minimizes a KD loss $\mathcal{L}^{\mathrm{KD}} = \mathcal{L} + \lambda \tilde{\mathcal{L}}$ with $\tilde{\mathcal{L}} = \frac{1}{N} \sum_i \mathrm{CE}(\boldsymbol p_i, f(\mathbf{x}_i;\boldsymbol{\theta}))$ using soft labels $\boldsymbol p_i$ from the teacher. The experimental evaluation on the 7T SciEd dataset and three mathematical reasoning tasks shows that the distilled student (0.03M parameters) achieves accuracy and F1 scores near the teacher while outperforming TinyBERT and ANN; it is also about 4,000× smaller and 10× faster in inference. This work demonstrates a practical path to deploying advanced AI for automatic scoring on typical school devices and points to future improvements in soft-label and prompt-processing techniques to further reduce teacher-related faults.

Abstract

This study proposes a method for knowledge distillation (KD) of fine-tuned Large Language Models (LLMs) into smaller, more efficient, and accurate neural networks. We specifically target the challenge of deploying these models on resource-constrained devices. Our methodology involves training the smaller student model (Neural Network) using the prediction probabilities (as soft labels) of the LLM, which serves as a teacher model. This is achieved through a specialized loss function tailored to learn from the LLM's output probabilities, ensuring that the student model closely mimics the teacher's performance. To validate the performance of the KD approach, we utilized a large dataset, 7T, containing 6,684 student-written responses to science questions and three mathematical reasoning datasets with student-written responses graded by human experts. We compared accuracy with state-of-the-art (SOTA) distilled models, TinyBERT, and artificial neural network (ANN) models. Results have shown that the KD approach has 3% and 2% higher scoring accuracy than ANN and TinyBERT, respectively, and comparable accuracy to the teacher model. Furthermore, the student model size is 0.03M, 4,000 times smaller in parameters and x10 faster in inferencing than the teacher model and TinyBERT, respectively. The significance of this research lies in its potential to make advanced AI technologies accessible in typical educational settings, particularly for automatic scoring.

Knowledge Distillation of LLM for Automatic Scoring of Science Education Assessments

TL;DR

The paper tackles automatic scoring in science education by distilling knowledge from a fine-tuned LLM (teacher) into a compact student suitable for resource-constrained devices. It introduces a teacher–student KD framework where the student minimizes a KD loss with using soft labels from the teacher. The experimental evaluation on the 7T SciEd dataset and three mathematical reasoning tasks shows that the distilled student (0.03M parameters) achieves accuracy and F1 scores near the teacher while outperforming TinyBERT and ANN; it is also about 4,000× smaller and 10× faster in inference. This work demonstrates a practical path to deploying advanced AI for automatic scoring on typical school devices and points to future improvements in soft-label and prompt-processing techniques to further reduce teacher-related faults.

Abstract

This study proposes a method for knowledge distillation (KD) of fine-tuned Large Language Models (LLMs) into smaller, more efficient, and accurate neural networks. We specifically target the challenge of deploying these models on resource-constrained devices. Our methodology involves training the smaller student model (Neural Network) using the prediction probabilities (as soft labels) of the LLM, which serves as a teacher model. This is achieved through a specialized loss function tailored to learn from the LLM's output probabilities, ensuring that the student model closely mimics the teacher's performance. To validate the performance of the KD approach, we utilized a large dataset, 7T, containing 6,684 student-written responses to science questions and three mathematical reasoning datasets with student-written responses graded by human experts. We compared accuracy with state-of-the-art (SOTA) distilled models, TinyBERT, and artificial neural network (ANN) models. Results have shown that the KD approach has 3% and 2% higher scoring accuracy than ANN and TinyBERT, respectively, and comparable accuracy to the teacher model. Furthermore, the student model size is 0.03M, 4,000 times smaller in parameters and x10 faster in inferencing than the teacher model and TinyBERT, respectively. The significance of this research lies in its potential to make advanced AI technologies accessible in typical educational settings, particularly for automatic scoring.
Paper Structure (12 sections, 2 equations, 1 figure, 2 tables)

This paper contains 12 sections, 2 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: The architecture of the proposed KD approach uses prediction probabilities as soft labels from the teacher model and forces the student model to achieve these prediction probabilities through the fitting loss function.