Table of Contents
Fetching ...

The Boiling-Frog Problem of Physics Education

Gerd Kortemeyer

TL;DR

The paper investigates how GPT-5 Thinking performs in first-year physics tasks to gauge the trajectory from novice to expert and its implications for instruction. It uses a multi-faceted evaluation across plug-and-chug problems, representation translation (TUG-K v4.0), card sorting based on solution structure, final exams, and epistemology (CLASS), finding symbolic problem solving and verification capabilities, near-expert performance in graphing translations, and expert-like attitudes. Key contributions include a demonstration that modern AI can avoid common novice pitfalls, a demonstration that its problem-categorization aligns with deep solution structures, and a concrete set of recommendations for replacing credit-bearing unsupervised online assessments with process-based, transparent, and authentic tasks. The study argues for a decisive pedagogical jump: emphasize modeling, data, authentic labs, open AI use disclosure, and collaborative problem-solving while focusing on what AI cannot substitute—humans' ability to model the world, argue from evidence, and make principled approximations. The work thus informs curriculum design, assessment reform, and AI-integrated teaching strategies in introductory physics.

Abstract

It is astonishing how rapidly general-purpose AI has crossed familiar thresholds in introductory physics. Comparing outputs from successive models, GPT-5 Thinking moves far beyond the plug-and-chug tendencies seen earlier: on a classic elevator problem it works symbolically, notes when variables cancel, and verifies results; attempts to prompt novice-like behavior mainly affect tone, not method. On representation translation, the model scores 24/26 (92.3%) on TUG-Kv4.0. In a card-sorting proxy using two of my comprehensive finals (60 items), its categories reflect solution method rather than surface features. Solving those same exams, it attains 27/30 and 25/30, with most misses in ruler-based ray tracing and circuit interpretation. On epistemology, five independent CLASS runs yield 100\% favorable, indicating a simulated expert-like stance. Framed as a "boiling frog" problem, the paper argues for a decisive jump: retire credit-bearing unsupervised closed-response online assessments; grade process evidence; use paper, whiteboarding; shift weight to modeling, data, and authentic labs; require transparent, citable AI use; rebuild problem types; and lean on research-based instruction and peer learning. The opportunity is to foreground what AI cannot substitute for: modeling the world, arguing from evidence, and making principled approximations.

The Boiling-Frog Problem of Physics Education

TL;DR

The paper investigates how GPT-5 Thinking performs in first-year physics tasks to gauge the trajectory from novice to expert and its implications for instruction. It uses a multi-faceted evaluation across plug-and-chug problems, representation translation (TUG-K v4.0), card sorting based on solution structure, final exams, and epistemology (CLASS), finding symbolic problem solving and verification capabilities, near-expert performance in graphing translations, and expert-like attitudes. Key contributions include a demonstration that modern AI can avoid common novice pitfalls, a demonstration that its problem-categorization aligns with deep solution structures, and a concrete set of recommendations for replacing credit-bearing unsupervised online assessments with process-based, transparent, and authentic tasks. The study argues for a decisive pedagogical jump: emphasize modeling, data, authentic labs, open AI use disclosure, and collaborative problem-solving while focusing on what AI cannot substitute—humans' ability to model the world, argue from evidence, and make principled approximations. The work thus informs curriculum design, assessment reform, and AI-integrated teaching strategies in introductory physics.

Abstract

It is astonishing how rapidly general-purpose AI has crossed familiar thresholds in introductory physics. Comparing outputs from successive models, GPT-5 Thinking moves far beyond the plug-and-chug tendencies seen earlier: on a classic elevator problem it works symbolically, notes when variables cancel, and verifies results; attempts to prompt novice-like behavior mainly affect tone, not method. On representation translation, the model scores 24/26 (92.3%) on TUG-Kv4.0. In a card-sorting proxy using two of my comprehensive finals (60 items), its categories reflect solution method rather than surface features. Solving those same exams, it attains 27/30 and 25/30, with most misses in ruler-based ray tracing and circuit interpretation. On epistemology, five independent CLASS runs yield 100\% favorable, indicating a simulated expert-like stance. Framed as a "boiling frog" problem, the paper argues for a decisive jump: retire credit-bearing unsupervised closed-response online assessments; grade process evidence; use paper, whiteboarding; shift weight to modeling, data, and authentic labs; require transparent, citable AI use; rebuild problem types; and lean on research-based instruction and peer learning. The opportunity is to foreground what AI cannot substitute for: modeling the world, arguing from evidence, and making principled approximations.

Paper Structure

This paper contains 8 sections, 1 figure, 1 table.

Figures (1)

  • Figure 1: The GPT-5-Thinking solution to the elevator problem.