Table of Contents
Fetching ...

Could an Artificial-Intelligence agent pass an introductory physics course?

Gerd Kortemeyer

TL;DR

The paper evaluates whether a state-of-the-art AI language model (ChatGPT, Jan 2023) can pass a calculus-based introductory physics course by solving representative assessments (FCI, homework, clickers, programming, exams) and grading it like a human student. It finds that ChatGPT would narrowly pass the course but harbor persistent novice misconceptions and arithmetic errors, performing exceptionally well only on computational programming tasks while underperforming on conceptual and numerical reasoning. The results raise important questions for physics education about integrity, assessment design, and the skills students must develop to work with AI. The study suggests focusing on metacognition, conceptual understanding, and computation-enabled curricula to prepare learners for AI-enabled environments.

Abstract

Massive pre-trained language models have garnered attention and controversy due to their ability to generate human-like responses: attention due to their frequent indistinguishability from human-generated phraseology and narratives, and controversy due to the fact that their convincingly presented arguments and facts are frequently simply false. Just how human-like are these responses when it comes to dialogues about physics, in particular about the standard content of introductory physics courses? This study explores that question by having ChatGTP, the pre-eminent language model in 2023, work through representative assessment content of an actual calculus-based physics course and grading the responses in the same way human responses would be graded. As it turns out, ChatGPT would narrowly pass this course while exhibiting many of the preconceptions and errors of a beginning learner.

Could an Artificial-Intelligence agent pass an introductory physics course?

TL;DR

The paper evaluates whether a state-of-the-art AI language model (ChatGPT, Jan 2023) can pass a calculus-based introductory physics course by solving representative assessments (FCI, homework, clickers, programming, exams) and grading it like a human student. It finds that ChatGPT would narrowly pass the course but harbor persistent novice misconceptions and arithmetic errors, performing exceptionally well only on computational programming tasks while underperforming on conceptual and numerical reasoning. The results raise important questions for physics education about integrity, assessment design, and the skills students must develop to work with AI. The study suggests focusing on metacognition, conceptual understanding, and computation-enabled curricula to prepare learners for AI-enabled environments.

Abstract

Massive pre-trained language models have garnered attention and controversy due to their ability to generate human-like responses: attention due to their frequent indistinguishability from human-generated phraseology and narratives, and controversy due to the fact that their convincingly presented arguments and facts are frequently simply false. Just how human-like are these responses when it comes to dialogues about physics, in particular about the standard content of introductory physics courses? This study explores that question by having ChatGTP, the pre-eminent language model in 2023, work through representative assessment content of an actual calculus-based physics course and grading the responses in the same way human responses would be graded. As it turns out, ChatGPT would narrowly pass this course while exhibiting many of the preconceptions and errors of a beginning learner.
Paper Structure (12 sections, 9 figures)

This paper contains 12 sections, 9 figures.

Figures (9)

  • Figure 1: A sample ChatGPT dialogue about a homework problem. The entries labelled with a red "KO" are by the author, the entries labelled in green by ChatGPT.
  • Figure 2: Text-based transcription of a graphical problem. The left panel shows the online version of a final exam problem in LON-CAPA (the graph would be parametrically randomized), the right panel the transcription for ChatGPT, as well as the ensuing dialogue.
  • Figure 3: Surface-feature modification of a Force Concept Inventory problem. The left panel shows the original problem, the right panel a modification.
  • Figure 4: Logical error in an attempt to solve the transcribed question 19 of the Force Concept Inventory.
  • Figure 5: A late-night dialogue between a "stubbornly guessing" ChatGPT and a frustrated author.
  • ...and 4 more figures