Table of Contents
Fetching ...

A systematic comparison of Large Language Models for automated assignment assessment in programming education: Exploring the importance of architecture and vendor

Marcin Jukiewicz

TL;DR

The study addresses the variability of automated programming-assignment grading across diverse LLM architectures and vendors. It applies a large-scale, side-by-side evaluation of $18$ models on $6081$ task records using Chain-of-Thought prompting and standard statistical metrics, including $ICC(2,1)$, Spearman correlations, and clustering. Key findings show systematic differences in grade distributions and means across models, six distinct evaluation-style clusters, and only moderate alignment with human teachers—even as model consensus remains high. The work highlights the non-neutral nature of model choice for education and calls for pedagogy-aligned deployment with sustained human oversight and re-evaluation as models evolve.

Abstract

This study presents the first large-scale, side-by-side comparison of contemporary Large Language Models (LLMs) in the automated grading of programming assignments. Drawing on over 6,000 student submissions collected across four years of an introductory programming course, we systematically analysed the distribution of grades, differences in mean scores and variability reflecting stricter or more lenient grading, and the consistency and clustering of grading patterns across models. Eighteen publicly available models were evaluated: Anthropic (claude-3-5-haiku, claude-opus-4-1, claude-sonnet-4); Deepseek (deepseek-chat, deepseek-reasoner); Google (gemini-2.0-flash-lite, gemini-2.0-flash, gemini-2.5-flash-lite, gemini-2.5-flash, gemini-2.5-pro); and OpenAI (gpt-4.1-mini, gpt-4.1-nano, gpt-4.1, gpt-4o-mini, gpt-4o, gpt-5-mini, gpt-5-nano, gpt-5). Statistical tests, correlation and clustering analyses revealed clear, systematic differences between and within vendor families, with "mini" and "nano" variants consistently underperforming their full-scale counterparts. All models displayed high internal agreement, measured by the intraclass correlation coefficient, with the model consensus but only moderate agreement with human teachers' grades, indicating a persistent gap between automated and human assessment. These findings underscore that the choice of model for educational deployment is not neutral and should be guided by pedagogical goals, transparent reporting of evaluation metrics, and ongoing human oversight to ensure accuracy, fairness and relevance.

A systematic comparison of Large Language Models for automated assignment assessment in programming education: Exploring the importance of architecture and vendor

TL;DR

The study addresses the variability of automated programming-assignment grading across diverse LLM architectures and vendors. It applies a large-scale, side-by-side evaluation of models on task records using Chain-of-Thought prompting and standard statistical metrics, including , Spearman correlations, and clustering. Key findings show systematic differences in grade distributions and means across models, six distinct evaluation-style clusters, and only moderate alignment with human teachers—even as model consensus remains high. The work highlights the non-neutral nature of model choice for education and calls for pedagogy-aligned deployment with sustained human oversight and re-evaluation as models evolve.

Abstract

This study presents the first large-scale, side-by-side comparison of contemporary Large Language Models (LLMs) in the automated grading of programming assignments. Drawing on over 6,000 student submissions collected across four years of an introductory programming course, we systematically analysed the distribution of grades, differences in mean scores and variability reflecting stricter or more lenient grading, and the consistency and clustering of grading patterns across models. Eighteen publicly available models were evaluated: Anthropic (claude-3-5-haiku, claude-opus-4-1, claude-sonnet-4); Deepseek (deepseek-chat, deepseek-reasoner); Google (gemini-2.0-flash-lite, gemini-2.0-flash, gemini-2.5-flash-lite, gemini-2.5-flash, gemini-2.5-pro); and OpenAI (gpt-4.1-mini, gpt-4.1-nano, gpt-4.1, gpt-4o-mini, gpt-4o, gpt-5-mini, gpt-5-nano, gpt-5). Statistical tests, correlation and clustering analyses revealed clear, systematic differences between and within vendor families, with "mini" and "nano" variants consistently underperforming their full-scale counterparts. All models displayed high internal agreement, measured by the intraclass correlation coefficient, with the model consensus but only moderate agreement with human teachers' grades, indicating a persistent gap between automated and human assessment. These findings underscore that the choice of model for educational deployment is not neutral and should be guided by pedagogical goals, transparent reporting of evaluation metrics, and ongoing human oversight to ensure accuracy, fairness and relevance.

Paper Structure

This paper contains 11 sections, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Relative frequency of grades assigned by different LLMs.
  • Figure 2: Spearman correlation heatmap (clipped to 0–1) for all LLM models.
  • Figure 3: Cohen’s $\kappa$ agreement heatmap between LLMs (values clipped to $[0,1]$).
  • Figure 4: Silhouette score for different numbers of clusters ($k$) in the $k$-means analysis of grade distributions.
  • Figure 5: Hierarchical clustering dendrogram based on Spearman correlations between models' grading patterns.
  • ...and 1 more figures