SteLLA: A Structured Grading System Using LLMs with RAG

Hefei Qiu; Brian White; Ashley Ding; Reinaldo Costa; Ali Hachem; Wei Ding; Ping Chen

SteLLA: A Structured Grading System Using LLMs with RAG

Hefei Qiu, Brian White, Ashley Ding, Reinaldo Costa, Ali Hachem, Wei Ding, Ping Chen

TL;DR

SteLLA introduces a structured QA-based automatic short-answer grading framework anchored by Retrieval-Augmented Generation (R-RAG), using instructor-provided reference answers and rubrics as a compact knowledge base to ground LLM evaluation. The system decomposes grading into evaluation questions with gold answers, performs LLM-based assessment, and then computes a final grade with breakdown feedback, achieving substantial agreement with human graders on a real biology exam dataset. Key findings show that clustering-based few-shot selection and moderate-shot prompting improve grading performance, while qualitative analyses reveal strengths in factual extraction and areas prone to excessive inference. The work advances ASAG by enabling reliable, explainable, and scalable grading and feedback, with potential extensions to missing rubrics and interactive personalization.

Abstract

Large Language Models (LLMs) have shown strong general capabilities in many applications. However, how to make them reliable tools for some specific tasks such as automated short answer grading (ASAG) remains a challenge. We present SteLLA (Structured Grading System Using LLMs with RAG) in which a) Retrieval Augmented Generation (RAG) approach is used to empower LLMs specifically on the ASAG task by extracting structured information from the highly relevant and reliable external knowledge based on the instructor-provided reference answer and rubric, b) an LLM performs a structured and question-answering-based evaluation of student answers to provide analytical grades and feedback. A real-world dataset that contains students' answers in an exam was collected from a college-level Biology course. Experiments show that our proposed system can achieve substantial agreement with the human grader while providing break-down grades and feedback on all the knowledge points examined in the problem. A qualitative and error analysis of the feedback generated by GPT4 shows that GPT4 is good at capturing facts while may be prone to inferring too much implication from the given text in the grading task which provides insights into the usage of LLMs in the ASAG system.

SteLLA: A Structured Grading System Using LLMs with RAG

TL;DR

Abstract

Paper Structure (21 sections, 4 figures, 2 tables)

This paper contains 21 sections, 4 figures, 2 tables.

Introduction
Background and Related Work
QA-Based Evaluation
Large Language Models (LLMs)
Retrieval-Augmented Generation
Automatic Short-Answer Grading
Method and System Architecture
R-RAG Module
LLM-based Evaluation Module
Scoring Module
Data
Data Source
Privacy Protection
Labeling Process
Characteristics and Statistics
...and 6 more sections

Figures (4)

Figure 1: (a) System architecture of SteLLA consisting of i) R-RAG Module which takes the instructor-provided reference answer and rubrics as inputs, generates and extracts a list of evaluation questions with gold answers, and sends it to the LLM; ii) LLM and QA-based Evaluation Module in which an LLM is prompted to perform grading using QA-based evaluation approach; iii) Scoring Module which generates a final grade and feedback. (b) R-RAG approach (c) Typical RAG approach
Figure 2: An example to show the flow of grading.
Figure 3: The problem, the reference answer, and the rubric in the dataset.
Figure 4: Effect of shot number.

SteLLA: A Structured Grading System Using LLMs with RAG

TL;DR

Abstract

SteLLA: A Structured Grading System Using LLMs with RAG

Authors

TL;DR

Abstract

Table of Contents

Figures (4)