Table of Contents
Fetching ...

A Chain-of-Thought Prompting Approach with LLMs for Evaluating Students' Formative Assessment Responses in Science

Clayton Cohn, Nicole Hutchins, Tuan Le, Gautam Biswas

TL;DR

This work investigates scalable, explainable scoring of open-ended K-12 science responses by combining Chain-of-Thought prompting with GPT-4 and a human-in-the-loop Active Learning workflow within the SPICE Earth Science curriculum. The approach aligns automated scores with a standards-based rubric and generates student-facing explanations, achieving strong agreement with human scorers across multiple subscores (many with $\kappa$ in the 0.8+ range and some near 1.0). The findings highlight both the potential and risks of CoT+AL, including overfitting and the need for careful rubric design and teacher collaboration to maximize classroom impact. The study points to practical pathways for deploying LLM-assisted formative assessment feedback while addressing ethical and reliability considerations in educational settings.

Abstract

This paper explores the use of large language models (LLMs) to score and explain short-answer assessments in K-12 science. While existing methods can score more structured math and computer science assessments, they often do not provide explanations for the scores. Our study focuses on employing GPT-4 for automated assessment in middle school Earth Science, combining few-shot and active learning with chain-of-thought reasoning. Using a human-in-the-loop approach, we successfully score and provide meaningful explanations for formative assessment responses. A systematic analysis of our method's pros and cons sheds light on the potential for human-in-the-loop techniques to enhance automated grading for open-ended science assessments.

A Chain-of-Thought Prompting Approach with LLMs for Evaluating Students' Formative Assessment Responses in Science

TL;DR

This work investigates scalable, explainable scoring of open-ended K-12 science responses by combining Chain-of-Thought prompting with GPT-4 and a human-in-the-loop Active Learning workflow within the SPICE Earth Science curriculum. The approach aligns automated scores with a standards-based rubric and generates student-facing explanations, achieving strong agreement with human scorers across multiple subscores (many with in the 0.8+ range and some near 1.0). The findings highlight both the potential and risks of CoT+AL, including overfitting and the need for careful rubric design and teacher collaboration to maximize classroom impact. The study points to practical pathways for deploying LLM-assisted formative assessment feedback while addressing ethical and reliability considerations in educational settings.

Abstract

This paper explores the use of large language models (LLMs) to score and explain short-answer assessments in K-12 science. While existing methods can score more structured math and computer science assessments, they often do not provide explanations for the scores. Our study focuses on employing GPT-4 for automated assessment in middle school Earth Science, combining few-shot and active learning with chain-of-thought reasoning. Using a human-in-the-loop approach, we successfully score and provide meaningful explanations for formative assessment responses. A systematic analysis of our method's pros and cons sheds light on the potential for human-in-the-loop techniques to enhance automated grading for open-ended science assessments.
Paper Structure (13 sections, 2 figures, 3 tables)

This paper contains 13 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: The fictitious student's conceptual model used by students to answers the assessment questions.
  • Figure 2: Our Chain-of-Thought Prompting + Active Learning approach. The green box encapsulates this process, where each of the blue diamonds is a step in that process. Yellow boxes represent the process's application to the classroom.