Concept-based Rubrics Improve LLM Formative Assessment and Data Synthesis

Yuchen Wei; Dennis Pearl; Matthew Beckman; Rebecca J. Passonneau

Concept-based Rubrics Improve LLM Formative Assessment and Data Synthesis

Yuchen Wei, Dennis Pearl, Matthew Beckman, Rebecca J. Passonneau

TL;DR

The paper tackles automated formative assessment in STEM by showing that concept-based, question-specific rubrics substantially boost LLMs, narrowing the gap with fully supervised PLMs. It compares supervised PLMs and LLM-in-context approaches, demonstrates substantial in-context gains for LLMs when using granular rubrics, and shows how LLM-generated data can train lightweight PLMs to match or approach LLM performance. It also investigates explanatory feedback from LLMs and introduces data-synthesis strategies, including diversity-enhancing relabeling, to distill LLM capabilities into efficient classifiers. The findings support a scalable, hybrid pipeline where LLMs handle rubrics-driven assessment and feedback, while distillation and synthetic data reduce labeling costs for practical deployment across diverse STEM domains.

Abstract

Formative assessment in STEM topics aims to promote student learning by identifying students' current understanding, thus targeting how to promote further learning. Previous studies suggest that the assessment performance of current generative large language models (LLMs) on constructed responses to open-ended questions is significantly lower than that of supervised classifiers trained on high-quality labeled data. However, we demonstrate that concept-based rubrics can significantly enhance LLM performance, which narrows the gap between LLMs as off-the shelf assessment tools, and smaller supervised models, which need large amounts of training data. For datasets where concept-based rubrics allow LLMs to achieve strong performance, we show that the concept-based rubrics help the same LLMs generate high quality synthetic data for training lightweight, high-performance supervised models. Our experiments span diverse STEM student response datasets with labels of varying quality, including a new real-world dataset that contains some AI-assisted responses, which introduces additional considerations.

Concept-based Rubrics Improve LLM Formative Assessment and Data Synthesis

TL;DR

Abstract

Concept-based Rubrics Improve LLM Formative Assessment and Data Synthesis

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)