Table of Contents
Fetching ...

LLM-based Automated Grading with Human-in-the-Loop

Yucheng Chu, Hang Li, Kaiqi Yang, Yasemin Copur-Gencturk, Jiliang Tang

TL;DR

This work tackles automatic short-answer grading (ASAG) by addressing the limitations of fully automated rubric-based methods through a human-in-the-loop framework, GradeHITL. The approach uses LLMs to grade against an adaptable rubric, while an RL-driven, three-agent system (Retriever, Reflector, Refiner) selects and leverages human-generated Q&A to refine rubrics. By incorporating targeted human feedback and justification via chain-of-thought prompts, GradeHITL achieves superior accuracy and rubric controllability compared with prior methods, demonstrated on a pedagogical dataset of mathematics teaching knowledge. The results suggest that interactive LLM-based grading with selective human input can approach human-level performance in rubric-based ASAG and offers practical benefits for classroom assessment workflows.

Abstract

The rise of artificial intelligence (AI) technologies, particularly large language models (LLMs), has brought significant advancements to the field of education. Among various applications, automatic short answer grading (ASAG), which focuses on evaluating open-ended textual responses, has seen remarkable progress with the introduction of LLMs. These models not only enhance grading performance compared to traditional ASAG approaches but also move beyond simple comparisons with predefined "golden" answers, enabling more sophisticated grading scenarios, such as rubric-based evaluation. However, existing LLM-powered methods still face challenges in achieving human-level grading performance in rubric-based assessments due to their reliance on fully automated approaches. In this work, we explore the potential of LLMs in ASAG tasks by leveraging their interactive capabilities through a human-in-the-loop (HITL) approach. Our proposed framework, GradeHITL, utilizes the generative properties of LLMs to pose questions to human experts, incorporating their insights to refine grading rubrics dynamically. This adaptive process significantly improves grading accuracy, outperforming existing methods and bringing ASAG closer to human-level evaluation.

LLM-based Automated Grading with Human-in-the-Loop

TL;DR

This work tackles automatic short-answer grading (ASAG) by addressing the limitations of fully automated rubric-based methods through a human-in-the-loop framework, GradeHITL. The approach uses LLMs to grade against an adaptable rubric, while an RL-driven, three-agent system (Retriever, Reflector, Refiner) selects and leverages human-generated Q&A to refine rubrics. By incorporating targeted human feedback and justification via chain-of-thought prompts, GradeHITL achieves superior accuracy and rubric controllability compared with prior methods, demonstrated on a pedagogical dataset of mathematics teaching knowledge. The results suggest that interactive LLM-based grading with selective human input can approach human-level performance in rubric-based ASAG and offers practical benefits for classroom assessment workflows.

Abstract

The rise of artificial intelligence (AI) technologies, particularly large language models (LLMs), has brought significant advancements to the field of education. Among various applications, automatic short answer grading (ASAG), which focuses on evaluating open-ended textual responses, has seen remarkable progress with the introduction of LLMs. These models not only enhance grading performance compared to traditional ASAG approaches but also move beyond simple comparisons with predefined "golden" answers, enabling more sophisticated grading scenarios, such as rubric-based evaluation. However, existing LLM-powered methods still face challenges in achieving human-level grading performance in rubric-based assessments due to their reliance on fully automated approaches. In this work, we explore the potential of LLMs in ASAG tasks by leveraging their interactive capabilities through a human-in-the-loop (HITL) approach. Our proposed framework, GradeHITL, utilizes the generative properties of LLMs to pose questions to human experts, incorporating their insights to refine grading rubrics dynamically. This adaptive process significantly improves grading accuracy, outperforming existing methods and bringing ASAG closer to human-level evaluation.

Paper Structure

This paper contains 21 sections, 1 equation, 7 figures, 2 tables, 1 algorithm.

Figures (7)

  • Figure 1: Illustration of GradeHITL
  • Figure 2: An example of prompt to LLM-based Grader.
  • Figure 3: An exemplar of question-asking prompt.
  • Figure 4: Illustration of reinforcement learning based Q&A selector.
  • Figure 5: An example of the prompt to Reflector.
  • ...and 2 more figures