A Chain-of-Thought Prompting Approach with LLMs for Evaluating Students' Formative Assessment Responses in Science

Clayton Cohn; Nicole Hutchins; Tuan Le; Gautam Biswas

A Chain-of-Thought Prompting Approach with LLMs for Evaluating Students' Formative Assessment Responses in Science

Clayton Cohn, Nicole Hutchins, Tuan Le, Gautam Biswas

TL;DR

This work investigates scalable, explainable scoring of open-ended K-12 science responses by combining Chain-of-Thought prompting with GPT-4 and a human-in-the-loop Active Learning workflow within the SPICE Earth Science curriculum. The approach aligns automated scores with a standards-based rubric and generates student-facing explanations, achieving strong agreement with human scorers across multiple subscores (many with $\kappa$ in the 0.8+ range and some near 1.0). The findings highlight both the potential and risks of CoT+AL, including overfitting and the need for careful rubric design and teacher collaboration to maximize classroom impact. The study points to practical pathways for deploying LLM-assisted formative assessment feedback while addressing ethical and reliability considerations in educational settings.

Abstract

This paper explores the use of large language models (LLMs) to score and explain short-answer assessments in K-12 science. While existing methods can score more structured math and computer science assessments, they often do not provide explanations for the scores. Our study focuses on employing GPT-4 for automated assessment in middle school Earth Science, combining few-shot and active learning with chain-of-thought reasoning. Using a human-in-the-loop approach, we successfully score and provide meaningful explanations for formative assessment responses. A systematic analysis of our method's pros and cons sheds light on the potential for human-in-the-loop techniques to enhance automated grading for open-ended science assessments.

A Chain-of-Thought Prompting Approach with LLMs for Evaluating Students' Formative Assessment Responses in Science

TL;DR

in the 0.8+ range and some near 1.0). The findings highlight both the potential and risks of CoT+AL, including overfitting and the need for careful rubric design and teacher collaboration to maximize classroom impact. The study points to practical pathways for deploying LLM-assisted formative assessment feedback while addressing ethical and reliability considerations in educational settings.

Abstract

Paper Structure (13 sections, 2 figures, 3 tables)

This paper contains 13 sections, 2 figures, 3 tables.

Introduction
Background
Methods
Curricular context
Study Design and Dataset
Model
Approach
Response Scoring.
Prompt Development.
Active Learning.
Results
Comparing Model and Human Performance
Conclusion and Future Implications

Figures (2)

Figure 1: The fictitious student's conceptual model used by students to answers the assessment questions.
Figure 2: Our Chain-of-Thought Prompting + Active Learning approach. The green box encapsulates this process, where each of the blue diamonds is a step in that process. Yellow boxes represent the process's application to the classroom.

A Chain-of-Thought Prompting Approach with LLMs for Evaluating Students' Formative Assessment Responses in Science

TL;DR

Abstract

A Chain-of-Thought Prompting Approach with LLMs for Evaluating Students' Formative Assessment Responses in Science

Authors

TL;DR

Abstract

Table of Contents

Figures (2)