Table of Contents
Fetching ...

When AI Evaluates Its Own Work: Validating Learner-Initiated, AI-Generated Physics Practice Problems

Tobias Geisler, Gerd Kortemeyer

TL;DR

The paper tackles the challenge of validating AI-generated, on-demand physics practice problems by benchmarking automated quality checks against expert judgments and learner preferences in a chatbot-driven workflow. Using 34 students and 543 generated problems, the authors benchmark commodity LLMs as judges, model the link between metrics and student choice with random forests, and triangulate findings with exit surveys. They identify a compact, three-tier metric stack (notably includes-solution-strategy, task-specific-and-complete, and measurement-unit-is-clearly-stated) that achieves comparable predictive power to a full metric battery while reducing cost and latency, enabling scalable real-time formative assessment in physics. The results offer a practical blueprint for deploying AI-generated practice across quantitative disciplines, while acknowledging limitations such as dataset size, single-expert ground truth, and the need for multimodal scoring and longitudinal validation.

Abstract

Large language models (LLMs) can now generate physics practice problems in real time, yet the educational value of these items hinges on rapid, reliable post-generation vetting. In this exploratory study, we investigated which automated checks are both technically feasible and pedagogically meaningful when exercises are produced on demand within a chatbot interface. A cohort of 34 introductory-physics students generated and attempted 543 practice problems during exam preparation. Each item was labeled by an expert on a wide range of quality attributes and presented to the learners in pairs to record their preference. We then (i) benchmarked three commodity LLMs as ``judges'' against the expert labels, (ii) quantified which attributes predict student choice via random-forest models, and (iii) triangulated these results with free-form exit surveys. Only a small subset of the original metric items proved necessary to reliably address student preferences either directly or by proxy. The study demonstrates that scalable formative assessment does not require exhaustive scoring: a carefully curated core of structural and learner-visible checks is sufficient to ensure both technical soundness and user appeal. The findings provide a practical blueprint for deploying real-time, AI-generated practice in physics and other quantitative disciplines.

When AI Evaluates Its Own Work: Validating Learner-Initiated, AI-Generated Physics Practice Problems

TL;DR

The paper tackles the challenge of validating AI-generated, on-demand physics practice problems by benchmarking automated quality checks against expert judgments and learner preferences in a chatbot-driven workflow. Using 34 students and 543 generated problems, the authors benchmark commodity LLMs as judges, model the link between metrics and student choice with random forests, and triangulate findings with exit surveys. They identify a compact, three-tier metric stack (notably includes-solution-strategy, task-specific-and-complete, and measurement-unit-is-clearly-stated) that achieves comparable predictive power to a full metric battery while reducing cost and latency, enabling scalable real-time formative assessment in physics. The results offer a practical blueprint for deploying AI-generated practice across quantitative disciplines, while acknowledging limitations such as dataset size, single-expert ground truth, and the need for multimodal scoring and longitudinal validation.

Abstract

Large language models (LLMs) can now generate physics practice problems in real time, yet the educational value of these items hinges on rapid, reliable post-generation vetting. In this exploratory study, we investigated which automated checks are both technically feasible and pedagogically meaningful when exercises are produced on demand within a chatbot interface. A cohort of 34 introductory-physics students generated and attempted 543 practice problems during exam preparation. Each item was labeled by an expert on a wide range of quality attributes and presented to the learners in pairs to record their preference. We then (i) benchmarked three commodity LLMs as ``judges'' against the expert labels, (ii) quantified which attributes predict student choice via random-forest models, and (iii) triangulated these results with free-form exit surveys. Only a small subset of the original metric items proved necessary to reliably address student preferences either directly or by proxy. The study demonstrates that scalable formative assessment does not require exhaustive scoring: a carefully curated core of structural and learner-visible checks is sufficient to ensure both technical soundness and user appeal. The findings provide a practical blueprint for deploying real-time, AI-generated practice in physics and other quantitative disciplines.

Paper Structure

This paper contains 29 sections, 1 equation, 7 figures, 8 tables.

Figures (7)

  • Figure 1: An example of an on-demand problem generated by GPT-5.2 Thinking.
  • Figure 2: Two interactive problems generated on-demand, showcasing how course-specific contextual information, in this case the premise of the problem "with the outer wall of the apartment building," is incorporated via Retrieval Augmented Generation (RAG) Lewis2020Retrieval. As opposed to traditional chatbot output as in Fig. \ref{['fig:example']}, the problems are rendered with an interactive answer field.
  • Figure 3: Attempting to solve the selected problem from Fig. \ref{['fig:solve']}.
  • Figure 4: Information flow in an enhanced chatbot that can generate verified practice problems on-the-fly and on-demand. The "problem validation" step (LLM-as-a-judge; green box) is the subject of our study.
  • Figure 5: Example of the definition for contains-misleading-extra-info in our LLM-as-a-judge pipeline (green box in Fig. \ref{['fig:target']}).
  • ...and 2 more figures