Automated Long Answer Grading with RiceChem Dataset

Shashank Sonkar; Kangqi Ni; Lesa Tran Lu; Kristi Kincaid; John S. Hutchinson; Richard G. Baraniuk

Automated Long Answer Grading with RiceChem Dataset

Shashank Sonkar, Kangqi Ni, Lesa Tran Lu, Kristi Kincaid, John S. Hutchinson, Richard G. Baraniuk

TL;DR

This work tackles Automated Long Answer Grading (ALAG), a task requiring fine-grained, fact-based evaluation of lengthy student responses. It introduces RiceChem, a dataset of 1264 chemistry responses annotated on 27 rubric items, enabling per-item scoring and interpretable feedback. The core contribution is formulating ALAG as a rubric entailment problem, leveraging Natural Language Inference (NLI) and transfer learning from MNLI to improve grading, and evaluating this against traditional score-based baselines and Large Language Models. Results show rubric-based entailment methods outperform score-based baselines and that MNLI transfer provides measurable gains, yet ALAG remains more challenging for LLMs than Automated Short Answer Grading tasks, highlighting the need for specialized models and data-efficient deployment strategies in education.

Abstract

We introduce a new area of study in the field of educational Natural Language Processing: Automated Long Answer Grading (ALAG). Distinguishing itself from Automated Short Answer Grading (ASAG) and Automated Essay Grading (AEG), ALAG presents unique challenges due to the complexity and multifaceted nature of fact-based long answers. To study ALAG, we introduce RiceChem, a dataset derived from a college chemistry course, featuring real student responses to long-answer questions with an average word count notably higher than typical ASAG datasets. We propose a novel approach to ALAG by formulating it as a rubric entailment problem, employing natural language inference models to verify whether each criterion, represented by a rubric item, is addressed in the student's response. This formulation enables the effective use of MNLI for transfer learning, significantly improving the performance of models on the RiceChem dataset. We demonstrate the importance of rubric-based formulation in ALAG, showcasing its superiority over traditional score-based approaches in capturing the nuances of student responses. We also investigate the performance of models in cold start scenarios, providing valuable insights into the practical deployment considerations in educational settings. Lastly, we benchmark state-of-the-art open-sourced Large Language Models (LLMs) on RiceChem and compare their results to GPT models, highlighting the increased complexity of ALAG compared to ASAG. Despite leveraging the benefits of a rubric-based approach and transfer learning from MNLI, the lower performance of LLMs on RiceChem underscores the significant difficulty posed by the ALAG task. With this work, we offer a fresh perspective on grading long, fact-based answers and introduce a new dataset to stimulate further research in this important area. Code: \url{https://github.com/luffycodes/Automated-Long-Answer-Grading}.

Automated Long Answer Grading with RiceChem Dataset

TL;DR

Abstract

Paper Structure (13 sections, 4 figures, 4 tables)

This paper contains 13 sections, 4 figures, 4 tables.

Introduction
Related Work: ALAG vs ASAG/AEG
Dataset and Method
RiceChem Dataset
Automated Long Answer Grading (ALAG)
Experiments
Experimental Setup
Benchmarking on Discriminative Models
The Value of Entailment Formulation in ALAG
The Importance of Rubric-based Formulation in ALAG
Benchmarking on Cold Start Scenarios
Benchmarking on Large Language Models
Conclusion

Figures (4)

Figure 1: Schematic illustration of the Automated Long Answer Grading (ALAG) using the RiceChem dataset. The figure highlights our novel approach of formulating ALAG as a rubric entailment problem, where each student response (premise) is paired with a corresponding rubric item (hypothesis). These pairs are then processed by a fine-tuned ALAG-transformer model, which predicts whether the response entails the rubric item. The use of rubrics in RiceChem allows for a detailed, point-by-point evaluation, making the grading process interpretable by design.
Figure 2: An example from our RiceChem dataset showing a question, rubric items, and a student response. Underlined rubric items have been correctly answered by the student.
Figure 3: Comparisons between the traditional score based grading approach and rubric-based ALAG approach on the RiceChem dataset. Rubric-based ALAG offers an average increase of $9.2\%$ in accuracy and an average increase of $15.4\%$ in F1 score, proving that breaking down grading into smaller rubric items helps models focus on smaller parts of the task instead of doing the entire task altogether. The improvement is evident across all models regardless of their number of parameters.
Figure 4: Performance of RoBERTa-Large and RoBERTa-Large-MNLI models with varying amounts of training data, ranging from 5% to 80%. The models show consistent improvement in accuracy and F1 score as more labeled data becomes available, with diminishing returns after 40% for RoBERTa-Large and 20% for RoBERTa-Large-MNLI.

Automated Long Answer Grading with RiceChem Dataset

TL;DR

Abstract

Automated Long Answer Grading with RiceChem Dataset

Authors

TL;DR

Abstract

Table of Contents

Figures (4)