Language Models are Few-Shot Graders

Chenyan Zhao; Mariana Silva; Seth Poulsen

Language Models are Few-Shot Graders

Chenyan Zhao, Mariana Silva, Seth Poulsen

TL;DR

This paper tackles the challenge of scalable, open-ended student assessment by proposing an LLM-based automatic short answer grading (ASAG) pipeline. It leverages few-shot prompting, retrieval-augmented generation (RAG) for graded-example selection, and grading rubrics to produce interpretable feedback without model fine-tuning, comparing GPT-4, GPT-4o, and o1-preview across multiple benchmarks. Results show that GPT-4o with graded examples and RAG delivers the best balance of accuracy and cost, achieving strong performance on Texas, SAF, and SciEntsBank datasets, while rubric guidance further improves accuracy on a mathematical induction task and enhances item-level interpretability. The work demonstrates practical pathways to deploy scalable, transparent ASAG in classrooms, with future directions including interface integration, customization of grading prompts, and evaluation of feedback quality for learning outcomes.

Abstract

Providing evaluations to student work is a critical component of effective student learning, and automating its process can significantly reduce the workload on human graders. Automatic Short Answer Grading (ASAG) systems, enabled by advancements in Large Language Models (LLMs), offer a promising solution for assessing and providing instant feedback for open-ended student responses. In this paper, we present an ASAG pipeline leveraging state-of-the-art LLMs. Our new LLM-based ASAG pipeline achieves better performances than existing custom-built models on the same datasets. We also compare the grading performance of three OpenAI models: GPT-4, GPT-4o, and o1-preview. Our results demonstrate that GPT-4o achieves the best balance between accuracy and cost-effectiveness. On the other hand, o1-preview, despite higher accuracy, exhibits a larger variance in error that makes it less practical for classroom use. We investigate the effects of incorporating instructor-graded examples into prompts using no examples, random selection, and Retrieval-Augmented Generation (RAG)-based selection strategies. Our findings indicate that providing graded examples enhances grading accuracy, with RAG-based selection outperforming random selection. Additionally, integrating grading rubrics improves accuracy by offering a structured standard for evaluation.

Language Models are Few-Shot Graders

TL;DR

Abstract

Language Models are Few-Shot Graders

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)