Table of Contents
Fetching ...

Language Models are Few-Shot Graders

Chenyan Zhao, Mariana Silva, Seth Poulsen

TL;DR

This paper tackles the challenge of scalable, open-ended student assessment by proposing an LLM-based automatic short answer grading (ASAG) pipeline. It leverages few-shot prompting, retrieval-augmented generation (RAG) for graded-example selection, and grading rubrics to produce interpretable feedback without model fine-tuning, comparing GPT-4, GPT-4o, and o1-preview across multiple benchmarks. Results show that GPT-4o with graded examples and RAG delivers the best balance of accuracy and cost, achieving strong performance on Texas, SAF, and SciEntsBank datasets, while rubric guidance further improves accuracy on a mathematical induction task and enhances item-level interpretability. The work demonstrates practical pathways to deploy scalable, transparent ASAG in classrooms, with future directions including interface integration, customization of grading prompts, and evaluation of feedback quality for learning outcomes.

Abstract

Providing evaluations to student work is a critical component of effective student learning, and automating its process can significantly reduce the workload on human graders. Automatic Short Answer Grading (ASAG) systems, enabled by advancements in Large Language Models (LLMs), offer a promising solution for assessing and providing instant feedback for open-ended student responses. In this paper, we present an ASAG pipeline leveraging state-of-the-art LLMs. Our new LLM-based ASAG pipeline achieves better performances than existing custom-built models on the same datasets. We also compare the grading performance of three OpenAI models: GPT-4, GPT-4o, and o1-preview. Our results demonstrate that GPT-4o achieves the best balance between accuracy and cost-effectiveness. On the other hand, o1-preview, despite higher accuracy, exhibits a larger variance in error that makes it less practical for classroom use. We investigate the effects of incorporating instructor-graded examples into prompts using no examples, random selection, and Retrieval-Augmented Generation (RAG)-based selection strategies. Our findings indicate that providing graded examples enhances grading accuracy, with RAG-based selection outperforming random selection. Additionally, integrating grading rubrics improves accuracy by offering a structured standard for evaluation.

Language Models are Few-Shot Graders

TL;DR

This paper tackles the challenge of scalable, open-ended student assessment by proposing an LLM-based automatic short answer grading (ASAG) pipeline. It leverages few-shot prompting, retrieval-augmented generation (RAG) for graded-example selection, and grading rubrics to produce interpretable feedback without model fine-tuning, comparing GPT-4, GPT-4o, and o1-preview across multiple benchmarks. Results show that GPT-4o with graded examples and RAG delivers the best balance of accuracy and cost, achieving strong performance on Texas, SAF, and SciEntsBank datasets, while rubric guidance further improves accuracy on a mathematical induction task and enhances item-level interpretability. The work demonstrates practical pathways to deploy scalable, transparent ASAG in classrooms, with future directions including interface integration, customization of grading prompts, and evaluation of feedback quality for learning outcomes.

Abstract

Providing evaluations to student work is a critical component of effective student learning, and automating its process can significantly reduce the workload on human graders. Automatic Short Answer Grading (ASAG) systems, enabled by advancements in Large Language Models (LLMs), offer a promising solution for assessing and providing instant feedback for open-ended student responses. In this paper, we present an ASAG pipeline leveraging state-of-the-art LLMs. Our new LLM-based ASAG pipeline achieves better performances than existing custom-built models on the same datasets. We also compare the grading performance of three OpenAI models: GPT-4, GPT-4o, and o1-preview. Our results demonstrate that GPT-4o achieves the best balance between accuracy and cost-effectiveness. On the other hand, o1-preview, despite higher accuracy, exhibits a larger variance in error that makes it less practical for classroom use. We investigate the effects of incorporating instructor-graded examples into prompts using no examples, random selection, and Retrieval-Augmented Generation (RAG)-based selection strategies. Our findings indicate that providing graded examples enhances grading accuracy, with RAG-based selection outperforming random selection. Additionally, integrating grading rubrics improves accuracy by offering a structured standard for evaluation.

Paper Structure

This paper contains 12 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Our prompt for grading. The LLM is provided with generic grading instructions, the question that was asked to the students, some example graded solutions, and the solution that we are asking it to provide a grade for. Examples are selected with strategies explained in Section \ref{['sec:example-selection']}.
  • Figure 2: A prompt for grading using a rubric. Similar to Section \ref{['sec:prompting']}, the first component is the same across all courses and datasets.
  • Figure 3: Boxplot showing the distribution of absolute errors for each dataset and grading method. The GPT-4 and GPT-4o models are run on the entire dataset, whereas the o1-preview model is run on 200 submissions at random.