A Chain-of-Thought Prompting Approach with LLMs for Evaluating Students' Formative Assessment Responses in Science
Clayton Cohn, Nicole Hutchins, Tuan Le, Gautam Biswas
TL;DR
This work investigates scalable, explainable scoring of open-ended K-12 science responses by combining Chain-of-Thought prompting with GPT-4 and a human-in-the-loop Active Learning workflow within the SPICE Earth Science curriculum. The approach aligns automated scores with a standards-based rubric and generates student-facing explanations, achieving strong agreement with human scorers across multiple subscores (many with $\kappa$ in the 0.8+ range and some near 1.0). The findings highlight both the potential and risks of CoT+AL, including overfitting and the need for careful rubric design and teacher collaboration to maximize classroom impact. The study points to practical pathways for deploying LLM-assisted formative assessment feedback while addressing ethical and reliability considerations in educational settings.
Abstract
This paper explores the use of large language models (LLMs) to score and explain short-answer assessments in K-12 science. While existing methods can score more structured math and computer science assessments, they often do not provide explanations for the scores. Our study focuses on employing GPT-4 for automated assessment in middle school Earth Science, combining few-shot and active learning with chain-of-thought reasoning. Using a human-in-the-loop approach, we successfully score and provide meaningful explanations for formative assessment responses. A systematic analysis of our method's pros and cons sheds light on the potential for human-in-the-loop techniques to enhance automated grading for open-ended science assessments.
