Table of Contents
Fetching ...

Using AI Large Language Models for Grading in Education: A Hands-On Test for Physics

Ryan Mok, Faraaz Akhtar, Louis Clare, Christine Li, Jun Ida, Lewis Ross, Mario Campanelli

TL;DR

The paper tackles the challenge of automating the grading of undergraduate physics solutions using large language models (LLMs) and assesses how well AI grading compares to human grading across Classical Mechanics, Electromagnetic Theory, and Quantum Mechanics. It introduces a five-stage empirical workflow, generating three solutions per problem and evaluating two grading regimes (blind vs. mark-scheme) across four LLMs (Gemini 1.5 Pro, GPT-4, GPT-4o, Claude 3.5 Sonnet) with human graders as a baseline, and analyzes 30 problems (10 per topic). The results show that AI grading is prone to mathematical errors and hallucinations, making it generally less reliable than humans in blind grading, but the presence of a mark scheme markedly improves accuracy and consistency, with GPT-4 approaching human performance ($r$ up to about $0.80$ under mark scheme grading). A key finding is the correlation between an AI model's problem-solving ability and its grading capability, and unsupervised clustering reveals topic-dependent patterns, notably separations for Classical Mechanics. Overall, the study offers a replicable methodology for evaluating AI grading in STEM and highlights the current limitations and pathways for enhancement (e.g., improved prompting, API access, broader topic coverage).

Abstract

Grading assessments is time-consuming and prone to human bias. Students may experience delays in receiving feedback that may not be tailored to their expectations or needs. Harnessing AI in education can be effective for grading undergraduate physics problems, enhancing the efficiency of undergraduate-level physics learning and teaching, and helping students understand concepts with the help of a constantly available tutor. This report devises a simple empirical procedure to investigate and quantify how well large language model (LLM) based AI chatbots can grade solutions to undergraduate physics problems in Classical Mechanics, Electromagnetic Theory and Quantum Mechanics, comparing humans against AI grading. The following LLMs were tested: Gemini 1.5 Pro, GPT-4, GPT-4o and Claude 3.5 Sonnet. The results show AI grading is prone to mathematical errors and hallucinations, which render it less effective than human grading, but when given a mark scheme, there is substantial improvement in grading quality, which becomes closer to the level of human performance - promising for future AI implementation. Evidence indicates that the grading ability of LLM is correlated with its problem-solving ability. Through unsupervised clustering, it is shown that Classical Mechanics problems may be graded differently from other topics. The method developed can be applied to investigate AI grading performance in other STEM fields.

Using AI Large Language Models for Grading in Education: A Hands-On Test for Physics

TL;DR

The paper tackles the challenge of automating the grading of undergraduate physics solutions using large language models (LLMs) and assesses how well AI grading compares to human grading across Classical Mechanics, Electromagnetic Theory, and Quantum Mechanics. It introduces a five-stage empirical workflow, generating three solutions per problem and evaluating two grading regimes (blind vs. mark-scheme) across four LLMs (Gemini 1.5 Pro, GPT-4, GPT-4o, Claude 3.5 Sonnet) with human graders as a baseline, and analyzes 30 problems (10 per topic). The results show that AI grading is prone to mathematical errors and hallucinations, making it generally less reliable than humans in blind grading, but the presence of a mark scheme markedly improves accuracy and consistency, with GPT-4 approaching human performance ( up to about under mark scheme grading). A key finding is the correlation between an AI model's problem-solving ability and its grading capability, and unsupervised clustering reveals topic-dependent patterns, notably separations for Classical Mechanics. Overall, the study offers a replicable methodology for evaluating AI grading in STEM and highlights the current limitations and pathways for enhancement (e.g., improved prompting, API access, broader topic coverage).

Abstract

Grading assessments is time-consuming and prone to human bias. Students may experience delays in receiving feedback that may not be tailored to their expectations or needs. Harnessing AI in education can be effective for grading undergraduate physics problems, enhancing the efficiency of undergraduate-level physics learning and teaching, and helping students understand concepts with the help of a constantly available tutor. This report devises a simple empirical procedure to investigate and quantify how well large language model (LLM) based AI chatbots can grade solutions to undergraduate physics problems in Classical Mechanics, Electromagnetic Theory and Quantum Mechanics, comparing humans against AI grading. The following LLMs were tested: Gemini 1.5 Pro, GPT-4, GPT-4o and Claude 3.5 Sonnet. The results show AI grading is prone to mathematical errors and hallucinations, which render it less effective than human grading, but when given a mark scheme, there is substantial improvement in grading quality, which becomes closer to the level of human performance - promising for future AI implementation. Evidence indicates that the grading ability of LLM is correlated with its problem-solving ability. Through unsupervised clustering, it is shown that Classical Mechanics problems may be graded differently from other topics. The method developed can be applied to investigate AI grading performance in other STEM fields.

Paper Structure

This paper contains 22 sections, 4 equations, 22 figures.

Figures (22)

  • Figure 1: Diagram example of a constructed automated grading system using an AI chatbot.
  • Figure 2: (Left) An example of an EM physics problem which has a corresponding figure. (Right) The same question is written in the LaTeX encoded form for prompt input.
  • Figure 3: (Top) An example QM word-based problem which requires explanation rather than heavy mathematics to answer. (Bottom) The same question is written in the LaTeX encoded form for prompt input.
  • Figure 4: A standard problem on Lorentz transformations with no corresponding figure that requires mathematical calculation to solve.
  • Figure 5: A small snippet of the mark scheme corresponding to the physics problem illustrated in Fig \ref{['EMproblem']}. It is allocated 14 marks in total. Its LaTeX format is also shown, which is used within prompts.
  • ...and 17 more figures