Table of Contents
Fetching ...

StepGrade: Grading Programming Assignments with Context-Aware LLMs

Mohammad Akyash, Kimia Zamiri Azar, Hadi Mardani Kamali

TL;DR

This paper addresses the challenge of scalable, interpretable grading of programming assignments in large courses by introducing StepGrade, a framework that uses Chain-of-Thought prompting with large language models to reason across multiple grading criteria. By decomposing evaluation into functional correctness, code quality, and algorithmic efficiency, StepGrade provides transparent, stepwise feedback that mirrors human reasoning. Empirical results on 30 Python submissions across Easy, Intermediate, and Advanced tasks show CoT prompting yields closer alignment with human graders and delivers more actionable feedback than regular prompting. The work demonstrates the potential for GPT-4–driven, context-aware grading to reduce educator workload while enhancing pedagogical value in programming education, including MOOCs.

Abstract

Grading programming assignments is a labor-intensive and time-consuming process that demands careful evaluation across multiple dimensions of the code. To overcome these challenges, automated grading systems are leveraged to enhance efficiency and reduce the workload on educators. Traditional automated grading systems often focus solely on correctness, failing to provide interpretable evaluations or actionable feedback for students. This study introduces StepGrade, which explores the use of Chain-of-Thought (CoT) prompting with Large Language Models (LLMs) as an innovative solution to address these challenges. Unlike regular prompting, which offers limited and surface-level outputs, CoT prompting allows the model to reason step-by-step through the interconnected grading criteria, i.e., functionality, code quality, and algorithmic efficiency, ensuring a more comprehensive and transparent evaluation. This interconnectedness necessitates the use of CoT to systematically address each criterion while considering their mutual influence. To empirically validate the efficiency of StepGrade, we conducted a case study involving 30 Python programming assignments across three difficulty levels (easy, intermediate, and advanced). The approach is validated against expert human evaluations to assess its consistency, accuracy, and fairness. Results demonstrate that CoT prompting significantly outperforms regular prompting in both grading quality and interpretability. By reducing the time and effort required for manual grading, this research demonstrates the potential of GPT-4 with CoT prompting to revolutionize programming education through scalable and pedagogically effective automated grading systems.

StepGrade: Grading Programming Assignments with Context-Aware LLMs

TL;DR

This paper addresses the challenge of scalable, interpretable grading of programming assignments in large courses by introducing StepGrade, a framework that uses Chain-of-Thought prompting with large language models to reason across multiple grading criteria. By decomposing evaluation into functional correctness, code quality, and algorithmic efficiency, StepGrade provides transparent, stepwise feedback that mirrors human reasoning. Empirical results on 30 Python submissions across Easy, Intermediate, and Advanced tasks show CoT prompting yields closer alignment with human graders and delivers more actionable feedback than regular prompting. The work demonstrates the potential for GPT-4–driven, context-aware grading to reduce educator workload while enhancing pedagogical value in programming education, including MOOCs.

Abstract

Grading programming assignments is a labor-intensive and time-consuming process that demands careful evaluation across multiple dimensions of the code. To overcome these challenges, automated grading systems are leveraged to enhance efficiency and reduce the workload on educators. Traditional automated grading systems often focus solely on correctness, failing to provide interpretable evaluations or actionable feedback for students. This study introduces StepGrade, which explores the use of Chain-of-Thought (CoT) prompting with Large Language Models (LLMs) as an innovative solution to address these challenges. Unlike regular prompting, which offers limited and surface-level outputs, CoT prompting allows the model to reason step-by-step through the interconnected grading criteria, i.e., functionality, code quality, and algorithmic efficiency, ensuring a more comprehensive and transparent evaluation. This interconnectedness necessitates the use of CoT to systematically address each criterion while considering their mutual influence. To empirically validate the efficiency of StepGrade, we conducted a case study involving 30 Python programming assignments across three difficulty levels (easy, intermediate, and advanced). The approach is validated against expert human evaluations to assess its consistency, accuracy, and fairness. Results demonstrate that CoT prompting significantly outperforms regular prompting in both grading quality and interpretability. By reducing the time and effort required for manual grading, this research demonstrates the potential of GPT-4 with CoT prompting to revolutionize programming education through scalable and pedagogically effective automated grading systems.

Paper Structure

This paper contains 20 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: The Overall Grading Flow in StepGrade Framework: Sequential Evaluation of Key Grading Criteria, i.e., Functionality, Code Quality, and Efficiency using CoT Prompting, with Detailed Feedback and Grades Outputs.
  • Figure 2: The Detailed Prompts utilized at Each Step of the COT prompting Process for Evaluating Functionality, Code Quality, and Algorithmic Efficiency.
  • Figure 3: An Example of Assignments and the Code Submitted by the Student.
  • Figure 4: The Example Feedback provided by the LLM using Regular Prompting and CoT Prompting (StepGrade) Approaches.