Grading Massive Open Online Courses Using Large Language Models

Shahriar Golchin; Nikhil Garuda; Christopher Impey; Matthew Wenger

Grading Massive Open Online Courses Using Large Language Models

Shahriar Golchin, Nikhil Garuda, Christopher Impey, Matthew Wenger

TL;DR

The paper investigates replacing MOOC peer grading with large language models by leveraging zero-shot chain-of-thought prompting across three courses and two models (GPT-4 and GPT-3.5). It introduces three prompting variants that incorporate instructor-provided correct answers and rubrics, or LLM-generated rubrics, and evaluates grading alignment against instructor grades using bootstrap resampling ($p$-value threshold $0.05$) and MAE metrics. Results show GPT-4 paired with ZCoT and instructor rubrics most closely matches instructor grades and often outperforms peer grading, especially in courses with well-defined rubrics; imaginative or speculative domains remain more challenging for both humans and LLMs. The findings demonstrate substantial potential for scalable automated grading in MOOCs, while highlighting limitations in open-ended domains and the need to address ethical concerns like fairness, transparency, and student perceptions of machine feedback. Overall, the approach offers a path toward reliable, scalable, and feedback-rich assessment for millions of online learners, with practical impact in education technology and online pedagogy.

Abstract

Massive open online courses (MOOCs) offer free education globally. Despite this democratization of learning, the massive enrollment in these courses makes it impractical for an instructor to assess every student's writing assignment. As a result, peer grading, often guided by a straightforward rubric, is the method of choice. While convenient, peer grading often falls short in terms of reliability and validity. In this study, we explore the feasibility of using large language models (LLMs) to replace peer grading in MOOCs. To this end, we adapt the zero-shot chain-of-thought (ZCoT) prompting technique to automate the feedback process once the LLM assigns a score to an assignment. Specifically, to instruct LLMs for grading, we use three distinct prompts based on ZCoT: (1) ZCoT with instructor-provided correct answers, (2) ZCoT with both instructor-provided correct answers and rubrics, and (3) ZCoT with instructor-provided correct answers and LLM-generated rubrics. We tested these prompts in 18 different scenarios using two LLMs, GPT-4 and GPT-3.5, across three MOOCs: Introductory Astronomy, Astrobiology, and the History and Philosophy of Astronomy. Our results show that ZCoT, when augmented with instructor-provided correct answers and rubrics, produces grades that are more aligned with those assigned by instructors compared to peer grading. Finally, our findings indicate a promising potential for automated grading systems in MOOCs, especially in subjects with well-defined rubrics, to improve the learning experience for millions of online learners worldwide.

Grading Massive Open Online Courses Using Large Language Models

TL;DR

-value threshold

) and MAE metrics. Results show GPT-4 paired with ZCoT and instructor rubrics most closely matches instructor grades and often outperforms peer grading, especially in courses with well-defined rubrics; imaginative or speculative domains remain more challenging for both humans and LLMs. The findings demonstrate substantial potential for scalable automated grading in MOOCs, while highlighting limitations in open-ended domains and the need to address ethical concerns like fairness, transparency, and student perceptions of machine feedback. Overall, the approach offers a path toward reliable, scalable, and feedback-rich assessment for millions of online learners, with practical impact in education technology and online pedagogy.

Abstract

Paper Structure (16 sections, 3 figures, 5 tables)

This paper contains 16 sections, 3 figures, 5 tables.

Introduction
Related Work
Approach
Prompts
Evaluation of LLM-Assigned Grades
Baseline
Experimental Setup
Results and Discussion
Qualitative Analysis: Average Grades Evaluation
Quantitative Analysis: Question-by-Question Evaluation
Conclusion
Limitations
Ethical Considerations
Details on Assignment Questions
Mean Absolute Error Results
...and 1 more sections

Figures (3)

Figure 1: An illustration of the ZCoT prompt along with answers provided by the course instructor. Each question is assessed individually for every student. We repeat this process for all questions and students, incorporating their answers into the prompt, and instructing the LLM to grade the assignments. In this example, the instructor-assigned grade is 6/9, with GPT-4 serving as the underlying LLM.
Figure 2: An illustration of ZCoT prompt that incorporates both instructor-provided correct answers and rubrics for grading assignments. The grading process used in ZCoT with correct answers only (Figure \ref{['figure:ZCoT-with-answers']}) is also applied here. Similar to Figure \ref{['figure:ZCoT-with-answers']}, the instructor-assigned grade for this question is 6/9, and GPT-4 is the base model. As shown, including rubrics in the prompt helps the LLM generate a grade consistent with the grade assigned by the instructor.
Figure 3: An illustration of the prompt utilized to generate rubrics using GPT-4 for the Astrobiology course. This procedure is repeated for all courses under study, substituting the course name, correct answers, total grades, and questions accordingly. The generated rubrics are then integrated into the ZCoT prompt along with the correct answers for assignment grading. Specifically, the prompt showcased in Figure \ref{['figure:ZCoT-with-answers-and-rubrics']} is employed for grading, where the instructor-provided rubrics are replaced with rubrics generated by LLM.

Grading Massive Open Online Courses Using Large Language Models

TL;DR

Abstract

Grading Massive Open Online Courses Using Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)