Table of Contents
Fetching ...

Concept-based Rubrics Improve LLM Formative Assessment and Data Synthesis

Yuchen Wei, Dennis Pearl, Matthew Beckman, Rebecca J. Passonneau

TL;DR

The paper tackles automated formative assessment in STEM by showing that concept-based, question-specific rubrics substantially boost LLMs, narrowing the gap with fully supervised PLMs. It compares supervised PLMs and LLM-in-context approaches, demonstrates substantial in-context gains for LLMs when using granular rubrics, and shows how LLM-generated data can train lightweight PLMs to match or approach LLM performance. It also investigates explanatory feedback from LLMs and introduces data-synthesis strategies, including diversity-enhancing relabeling, to distill LLM capabilities into efficient classifiers. The findings support a scalable, hybrid pipeline where LLMs handle rubrics-driven assessment and feedback, while distillation and synthetic data reduce labeling costs for practical deployment across diverse STEM domains.

Abstract

Formative assessment in STEM topics aims to promote student learning by identifying students' current understanding, thus targeting how to promote further learning. Previous studies suggest that the assessment performance of current generative large language models (LLMs) on constructed responses to open-ended questions is significantly lower than that of supervised classifiers trained on high-quality labeled data. However, we demonstrate that concept-based rubrics can significantly enhance LLM performance, which narrows the gap between LLMs as off-the shelf assessment tools, and smaller supervised models, which need large amounts of training data. For datasets where concept-based rubrics allow LLMs to achieve strong performance, we show that the concept-based rubrics help the same LLMs generate high quality synthetic data for training lightweight, high-performance supervised models. Our experiments span diverse STEM student response datasets with labels of varying quality, including a new real-world dataset that contains some AI-assisted responses, which introduces additional considerations.

Concept-based Rubrics Improve LLM Formative Assessment and Data Synthesis

TL;DR

The paper tackles automated formative assessment in STEM by showing that concept-based, question-specific rubrics substantially boost LLMs, narrowing the gap with fully supervised PLMs. It compares supervised PLMs and LLM-in-context approaches, demonstrates substantial in-context gains for LLMs when using granular rubrics, and shows how LLM-generated data can train lightweight PLMs to match or approach LLM performance. It also investigates explanatory feedback from LLMs and introduces data-synthesis strategies, including diversity-enhancing relabeling, to distill LLM capabilities into efficient classifiers. The findings support a scalable, hybrid pipeline where LLMs handle rubrics-driven assessment and feedback, while distillation and synthetic data reduce labeling costs for practical deployment across diverse STEM domains.

Abstract

Formative assessment in STEM topics aims to promote student learning by identifying students' current understanding, thus targeting how to promote further learning. Previous studies suggest that the assessment performance of current generative large language models (LLMs) on constructed responses to open-ended questions is significantly lower than that of supervised classifiers trained on high-quality labeled data. However, we demonstrate that concept-based rubrics can significantly enhance LLM performance, which narrows the gap between LLMs as off-the shelf assessment tools, and smaller supervised models, which need large amounts of training data. For datasets where concept-based rubrics allow LLMs to achieve strong performance, we show that the concept-based rubrics help the same LLMs generate high quality synthetic data for training lightweight, high-performance supervised models. Our experiments span diverse STEM student response datasets with labels of varying quality, including a new real-world dataset that contains some AI-assisted responses, which introduces additional considerations.

Paper Structure

This paper contains 12 sections, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Prompt template for evaluating GPT 4o-mini performance.
  • Figure 2: GPT4o-Mini performance across different datasets, incorporating different numbers of examples and the question-specific rubric. Note that number of examples is per label per question; 5 examples per label corresponds to $3 \times 5 = 15$ examples per prompt.
  • Figure 3: Sample questions and rubrics for ASAP, ISTUDIO, and CLASSIFIES.
  • Figure 4: Sensitivity of PLM accuracy to varying amounts of DiversityEnhancing synthesized samples on CLASSIFIES and ISTUDIO.