Table of Contents
Fetching ...

Marking: Visual Grading with Highlighting Errors and Annotating Missing Bits

Shashank Sonkar, Naiming Liu, Debshila B. Mallick, Richard G. Baraniuk

TL;DR

This work introduces Marking, a granular automated grading task that highlights correct, incorrect, and irrelevant portions of student responses while detecting omissions from a gold standard, reframing the task as an NLI extension. It presents BioMarking, a biology-focused dataset curated by subject-matter experts, and demonstrates baseline performance using transformer models (BERT/RoBERTa) with pretraining on e-SNLI and novel preprocessing like Dual Instance Pairing and stopword removal. The study defines a per-token labeling scheme with multiple label settings (Generic, Contradiction-focused, and Error-focused) and shows that RoBERTa-large, particularly under the Error-focused setting, achieves the strongest performance, illustrating the promise and challenge of nuanced feedback in AI-based assessment. Overall, Marking offers a scalable, informative feedback mechanism that can transform AI-powered education by signaling both what students got right or wrong and what essential concepts they miss, supported by open-source code and benchmarks for future work.

Abstract

In this paper, we introduce "Marking", a novel grading task that enhances automated grading systems by performing an in-depth analysis of student responses and providing students with visual highlights. Unlike traditional systems that provide binary scores, "marking" identifies and categorizes segments of the student response as correct, incorrect, or irrelevant and detects omissions from gold answers. We introduce a new dataset meticulously curated by Subject Matter Experts specifically for this task. We frame "Marking" as an extension of the Natural Language Inference (NLI) task, which is extensively explored in the field of Natural Language Processing. The gold answer and the student response play the roles of premise and hypothesis in NLI, respectively. We subsequently train language models to identify entailment, contradiction, and neutrality from student response, akin to NLI, and with the added dimension of identifying omissions from gold answers. Our experimental setup involves the use of transformer models, specifically BERT and RoBERTa, and an intelligent training step using the e-SNLI dataset. We present extensive baseline results highlighting the complexity of the "Marking" task, which sets a clear trajectory for the upcoming study. Our work not only opens up new avenues for research in AI-powered educational assessment tools, but also provides a valuable benchmark for the AI in education community to engage with and improve upon in the future. The code and dataset can be found at https://github.com/luffycodes/marking.

Marking: Visual Grading with Highlighting Errors and Annotating Missing Bits

TL;DR

This work introduces Marking, a granular automated grading task that highlights correct, incorrect, and irrelevant portions of student responses while detecting omissions from a gold standard, reframing the task as an NLI extension. It presents BioMarking, a biology-focused dataset curated by subject-matter experts, and demonstrates baseline performance using transformer models (BERT/RoBERTa) with pretraining on e-SNLI and novel preprocessing like Dual Instance Pairing and stopword removal. The study defines a per-token labeling scheme with multiple label settings (Generic, Contradiction-focused, and Error-focused) and shows that RoBERTa-large, particularly under the Error-focused setting, achieves the strongest performance, illustrating the promise and challenge of nuanced feedback in AI-based assessment. Overall, Marking offers a scalable, informative feedback mechanism that can transform AI-powered education by signaling both what students got right or wrong and what essential concepts they miss, supported by open-source code and benchmarks for future work.

Abstract

In this paper, we introduce "Marking", a novel grading task that enhances automated grading systems by performing an in-depth analysis of student responses and providing students with visual highlights. Unlike traditional systems that provide binary scores, "marking" identifies and categorizes segments of the student response as correct, incorrect, or irrelevant and detects omissions from gold answers. We introduce a new dataset meticulously curated by Subject Matter Experts specifically for this task. We frame "Marking" as an extension of the Natural Language Inference (NLI) task, which is extensively explored in the field of Natural Language Processing. The gold answer and the student response play the roles of premise and hypothesis in NLI, respectively. We subsequently train language models to identify entailment, contradiction, and neutrality from student response, akin to NLI, and with the added dimension of identifying omissions from gold answers. Our experimental setup involves the use of transformer models, specifically BERT and RoBERTa, and an intelligent training step using the e-SNLI dataset. We present extensive baseline results highlighting the complexity of the "Marking" task, which sets a clear trajectory for the upcoming study. Our work not only opens up new avenues for research in AI-powered educational assessment tools, but also provides a valuable benchmark for the AI in education community to engage with and improve upon in the future. The code and dataset can be found at https://github.com/luffycodes/marking.
Paper Structure (21 sections, 5 equations, 1 figure, 3 tables)

This paper contains 21 sections, 5 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: An illustration of the "Marking" task, which is formulated as NLI where the "gold answer" represents the premise and the "student response" is the hypothesis. The correct parts of the student response is classified as entailment, the incorrect parts as contradiction, and irrelevant part as neutral.