Scalable Generation and Validation of Isomorphic Physics Problems with GenAI

Naiming Liu; Leo Murch; Spencer Moore; Tong Wan; Shashank Sonkar; Richard Baraniuk; Zhongzhou Chen

Scalable Generation and Validation of Isomorphic Physics Problems with GenAI

Naiming Liu, Leo Murch, Spencer Moore, Tong Wan, Shashank Sonkar, Richard Baraniuk, Zhongzhou Chen

TL;DR

The paper addresses the scalability and fairness challenges of STEM assessments by proposing isomorphic physics problem banks generated with GenAI and validated with LM-based simulations. The authors introduce a prompt-chaining and tool-use framework to create large banks (ESTELA-Physics) with controlled structural variations and diverse contexts, alongside a pre-deployment LM validation pipeline using 17 open-source models. Empirical results show that about $73\%$ of banks are statistically homogeneous in difficulty, and LM performance correlates with student outcomes up to $\rho=0.594$, while effectively flagging problematic variants and text issues. Model scale and architecture critically influence validation utility, with mid-sized instruction- or reasoning-focused LMs providing the most useful proxies for detecting difficulty outliers, enabling scalable, low-cost quality assurance for large isomorphic banks. The work has practical implications for asynchronous assessment, real-time problem generation, and personalized practice at scale, with future directions including richer diagram generation, multi-agent automation, and ability-stratified LM simulations.

Abstract

Traditional synchronous STEM assessments face growing challenges including accessibility barriers, security concerns from resource-sharing platforms, and limited comparability across institutions. We present a framework for generating and evaluating large-scale isomorphic physics problem banks using Generative AI to enable asynchronous, multi-attempt assessments. Isomorphic problems test identical concepts through varied surface features and contexts, providing richer variation than conventional parameterized questions while maintaining consistent difficulty. Our generation framework employs prompt chaining and tool use to achieve precise control over structural variations (numeric values, spatial relations) alongside diverse contextual variations. For pre-deployment validation, we evaluate generated items using 17 open-source language models (LMs) (0.6B-32B) and compare against actual student performance (N>200) across three midterm exams. Results show that 73% of deployed banks achieve statistically homogeneous difficulty, and LMs pattern correlate strongly with student performance (Pearson's $ρ$ up to 0.594). Additionally, LMs successfully identify problematic variants, such as ambiguous problem texts. Model scale also proves critical for effective validation, where extremely small (<4B) and large (>14B) models exhibit floor and ceiling effects respectively, making mid-sized models optimal for detecting difficulty outliers.

Scalable Generation and Validation of Isomorphic Physics Problems with GenAI

TL;DR

of banks are statistically homogeneous in difficulty, and LM performance correlates with student outcomes up to

, while effectively flagging problematic variants and text issues. Model scale and architecture critically influence validation utility, with mid-sized instruction- or reasoning-focused LMs providing the most useful proxies for detecting difficulty outliers, enabling scalable, low-cost quality assurance for large isomorphic banks. The work has practical implications for asynchronous assessment, real-time problem generation, and personalized practice at scale, with future directions including richer diagram generation, multi-agent automation, and ability-stratified LM simulations.

Abstract

up to 0.594). Additionally, LMs successfully identify problematic variants, such as ambiguous problem texts. Model scale also proves critical for effective validation, where extremely small (<4B) and large (>14B) models exhibit floor and ceiling effects respectively, making mid-sized models optimal for detecting difficulty outliers.

Paper Structure (22 sections, 1 figure, 3 tables)

This paper contains 22 sections, 1 figure, 3 tables.

Introduction
Contributions:
Related Works
Automated Question Generation
Estimating Question Difficulty using LLMs
Isomorphic Problem Generation
AI-assisted Isomorphic Problem Generation
Problem Generation Framework
Example Problem Bank Creation: Angled Force with Friction
ESTELA-Physics Dataset
Validation of Isomorphic Problem Banks
LM-based Isomorphic Problem Validation
Collection of Actual Student Performance Data
Evaluation
Results and Discussions
...and 7 more sections

Figures (1)

Figure 1: Comparison of student data accuracy distribution (Top) and LM simulated accuracy (Down) of problem bank 6-1. Error bars represent standard error of measurement.

Scalable Generation and Validation of Isomorphic Physics Problems with GenAI

TL;DR

Abstract

Scalable Generation and Validation of Isomorphic Physics Problems with GenAI

Authors

TL;DR

Abstract

Table of Contents

Figures (1)