Scalable Generation and Validation of Isomorphic Physics Problems with GenAI
Naiming Liu, Leo Murch, Spencer Moore, Tong Wan, Shashank Sonkar, Richard Baraniuk, Zhongzhou Chen
TL;DR
The paper addresses the scalability and fairness challenges of STEM assessments by proposing isomorphic physics problem banks generated with GenAI and validated with LM-based simulations. The authors introduce a prompt-chaining and tool-use framework to create large banks (ESTELA-Physics) with controlled structural variations and diverse contexts, alongside a pre-deployment LM validation pipeline using 17 open-source models. Empirical results show that about $73\%$ of banks are statistically homogeneous in difficulty, and LM performance correlates with student outcomes up to $\rho=0.594$, while effectively flagging problematic variants and text issues. Model scale and architecture critically influence validation utility, with mid-sized instruction- or reasoning-focused LMs providing the most useful proxies for detecting difficulty outliers, enabling scalable, low-cost quality assurance for large isomorphic banks. The work has practical implications for asynchronous assessment, real-time problem generation, and personalized practice at scale, with future directions including richer diagram generation, multi-agent automation, and ability-stratified LM simulations.
Abstract
Traditional synchronous STEM assessments face growing challenges including accessibility barriers, security concerns from resource-sharing platforms, and limited comparability across institutions. We present a framework for generating and evaluating large-scale isomorphic physics problem banks using Generative AI to enable asynchronous, multi-attempt assessments. Isomorphic problems test identical concepts through varied surface features and contexts, providing richer variation than conventional parameterized questions while maintaining consistent difficulty. Our generation framework employs prompt chaining and tool use to achieve precise control over structural variations (numeric values, spatial relations) alongside diverse contextual variations. For pre-deployment validation, we evaluate generated items using 17 open-source language models (LMs) (0.6B-32B) and compare against actual student performance (N>200) across three midterm exams. Results show that 73% of deployed banks achieve statistically homogeneous difficulty, and LMs pattern correlate strongly with student performance (Pearson's $ρ$ up to 0.594). Additionally, LMs successfully identify problematic variants, such as ambiguous problem texts. Model scale also proves critical for effective validation, where extremely small (<4B) and large (>14B) models exhibit floor and ceiling effects respectively, making mid-sized models optimal for detecting difficulty outliers.
