Table of Contents
Fetching ...

LLM-as-a-Grader: Practical Insights from Large Language Model for Short-Answer and Report Evaluation

Grace Byun, Swati Rajwal, Jinho D. Choi

TL;DR

This work evaluates GPT-4o as an automatic grader for short-answer quizzes and team project reports in a real undergraduate Computational Linguistics course, comparing LLM scores to TA evaluations. The authors implement a Python-based autograding pipeline and two prompting configurations to produce fine-grained scores with justifications, and they release an open-source LLM-as-a-Grader toolkit along with sample data. Results show strong alignment with human graders, with correlations up to 0.98 for quizzes and largely similar section-level scores for projects, though GPT tends to be more conservative on technical sections. The study demonstrates the practicality and scalability of LLM-based grading in real classrooms and discusses limitations and directions for future work, including multimodal and multilingual grading and educational impact.

Abstract

Large Language Models (LLMs) are increasingly explored for educational tasks such as grading, yet their alignment with human evaluation in real classrooms remains underexamined. In this study, we investigate the feasibility of using an LLM (GPT-4o) to evaluate short-answer quizzes and project reports in an undergraduate Computational Linguistics course. We collect responses from approximately 50 students across five quizzes and receive project reports from 14 teams. LLM-generated scores are compared against human evaluations conducted independently by the course teaching assistants (TAs). Our results show that GPT-4o achieves strong correlation with human graders (up to 0.98) and exact score agreement in 55\% of quiz cases. For project reports, it also shows strong overall alignment with human grading, while exhibiting some variability in scoring technical, open-ended responses. We release all code and sample data to support further research on LLMs in educational assessment. This work highlights both the potential and limitations of LLM-based grading systems and contributes to advancing automated grading in real-world academic settings.

LLM-as-a-Grader: Practical Insights from Large Language Model for Short-Answer and Report Evaluation

TL;DR

This work evaluates GPT-4o as an automatic grader for short-answer quizzes and team project reports in a real undergraduate Computational Linguistics course, comparing LLM scores to TA evaluations. The authors implement a Python-based autograding pipeline and two prompting configurations to produce fine-grained scores with justifications, and they release an open-source LLM-as-a-Grader toolkit along with sample data. Results show strong alignment with human graders, with correlations up to 0.98 for quizzes and largely similar section-level scores for projects, though GPT tends to be more conservative on technical sections. The study demonstrates the practicality and scalability of LLM-based grading in real classrooms and discusses limitations and directions for future work, including multimodal and multilingual grading and educational impact.

Abstract

Large Language Models (LLMs) are increasingly explored for educational tasks such as grading, yet their alignment with human evaluation in real classrooms remains underexamined. In this study, we investigate the feasibility of using an LLM (GPT-4o) to evaluate short-answer quizzes and project reports in an undergraduate Computational Linguistics course. We collect responses from approximately 50 students across five quizzes and receive project reports from 14 teams. LLM-generated scores are compared against human evaluations conducted independently by the course teaching assistants (TAs). Our results show that GPT-4o achieves strong correlation with human graders (up to 0.98) and exact score agreement in 55\% of quiz cases. For project reports, it also shows strong overall alignment with human grading, while exhibiting some variability in scoring technical, open-ended responses. We release all code and sample data to support further research on LLMs in educational assessment. This work highlights both the potential and limitations of LLM-based grading systems and contributes to advancing automated grading in real-world academic settings.

Paper Structure

This paper contains 15 sections, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Our toolkit evaluates both short-answer quiz responses (left) and reports (right). For quizzes, a student answer is compared to a reference answer and scored based on correctness. For reports, text is extracted from PDF files and evaluated based on pre-defined rubric. Explanations are generated in both cases to justify the score.
  • Figure 2: Prompt used to grade quiz responses. full_score refers to the maximum points for the question. valid_scores lists all possible scores the grader can assign.