Table of Contents
Fetching ...

Enhancing Multi-Domain Automatic Short Answer Grading through an Explainable Neuro-Symbolic Pipeline

Felix Künnecke, Anna Filighera, Colin Leong, Tim Steuer

TL;DR

This work contributes a weakly supervised annotation procedure for justification cues in ASAG datasets, and a neuro-symbolic model for explainable ASAG based on justification cues, which provides a promising direction for generating high-quality grades and accompanying explanations for future research in ASAG and educational NLP.

Abstract

Grading short answer questions automatically with interpretable reasoning behind the grading decision is a challenging goal for current transformer approaches. Justification cue detection, in combination with logical reasoners, has shown a promising direction for neuro-symbolic architectures in ASAG. But, one of the main challenges is the requirement of annotated justification cues in the students' responses, which only exist for a few ASAG datasets. To overcome this challenge, we contribute (1) a weakly supervised annotation procedure for justification cues in ASAG datasets, and (2) a neuro-symbolic model for explainable ASAG based on justification cues. Our approach improves upon the RMSE by 0.24 to 0.3 compared to the state-of-the-art on the Short Answer Feedback dataset in a bilingual, multi-domain, and multi-question training setup. This result shows that our approach provides a promising direction for generating high-quality grades and accompanying explanations for future research in ASAG and educational NLP.

Enhancing Multi-Domain Automatic Short Answer Grading through an Explainable Neuro-Symbolic Pipeline

TL;DR

This work contributes a weakly supervised annotation procedure for justification cues in ASAG datasets, and a neuro-symbolic model for explainable ASAG based on justification cues, which provides a promising direction for generating high-quality grades and accompanying explanations for future research in ASAG and educational NLP.

Abstract

Grading short answer questions automatically with interpretable reasoning behind the grading decision is a challenging goal for current transformer approaches. Justification cue detection, in combination with logical reasoners, has shown a promising direction for neuro-symbolic architectures in ASAG. But, one of the main challenges is the requirement of annotated justification cues in the students' responses, which only exist for a few ASAG datasets. To overcome this challenge, we contribute (1) a weakly supervised annotation procedure for justification cues in ASAG datasets, and (2) a neuro-symbolic model for explainable ASAG based on justification cues. Our approach improves upon the RMSE by 0.24 to 0.3 compared to the state-of-the-art on the Short Answer Feedback dataset in a bilingual, multi-domain, and multi-question training setup. This result shows that our approach provides a promising direction for generating high-quality grades and accompanying explanations for future research in ASAG and educational NLP.
Paper Structure (23 sections, 4 figures, 6 tables)

This paper contains 23 sections, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Schematic visualization of our approach with an exemplary student answer. The yellow phrases are the recognized justification cues matched based on their similarity to the scoring rubric. The resulting scoring vector is fed into our symbolic grading models, which are responsible for predicting the final score. This allows our model to provide the actual justification cues, including their similarity to the scoring rubric which was used by the model for the final grading decision.
  • Figure 2: Schematic visualization of the full pipeline of our approach. The pipeline contains three stages: (1) Weak Supervision: annotates the ASAG corpus with silver labels. (2) Justification Cue Detection: transformer model trained on the silver labels for finding justification cues in the student answers. (3) Grading: a symbolic model that uses the extracted justification cues for grading based on the similarity to the respective scoring rubric.
  • Figure 3: Visualization of the grading process, where the underlying justification cue model retrieves the respective student's answer and context and detects all justification cues. Along the scoring rubric, the justification cues are matched to generate a scoring vector fed into the symbolic grading model. To return feedback on the actual prediction from the grading model, we calculate the loss $L(\hat{Y}, Y)$ and backpropagate it to the justification cue model.
  • Figure 4: Demonstration of our metrics based on word-tokens for the example from the question: What is the difference between asynchronous and synchronous transmission mode in the Data Link Layer?. We highlighted the predicted justification cues in yellow. Number of Justification cues: 2, Average Number of Tokens per Justification Cue: 10, Percentage of Justification Cue Tokens: 0.345.