Testing GPT-4 with Wolfram Alpha and Code Interpreter plug-ins on math and science problems

Ernest Davis; Scott Aaronson

Testing GPT-4 with Wolfram Alpha and Code Interpreter plug-ins on math and science problems

Ernest Davis, Scott Aaronson

TL;DR

This study evaluates GPT-4 augmented with Wolfram Alpha and Code Interpreter plug-ins on 105 original high school and college level science and math problems, organized into three test sets to probe numerical, calculation-free, and motivated numerical tasks. The authors report that plug-ins substantially improve performance over GPT-4 alone, but observe pervasive interface failures and a gap between plug-in capability and effective use, suggesting the need for interactive, iterative problem solving with human oversight. Quantitatively, the results show mixed success across sets, with $8.25/32$ and $10/32$ on Arbitrary Numerical, $30.7/53$ and $34.2/53$ on Calculation-Free, and $14.3/20$ and $13.8/20$ on Motivated Numerical for WA and CI, respectively. The work argues for improved plug-in interfaces and cautions that while these systems approach undergraduate-level competence on some tasks, they are not yet reliable enough for autonomous college-level calculation workloads.

Abstract

This report describes a test of the large language model GPT-4 with the Wolfram Alpha and the Code Interpreter plug-ins on 105 original problems in science and math, at the high school and college levels, carried out in June-August 2023. Our tests suggest that the plug-ins significantly enhance GPT's ability to solve these problems. Having said that, there are still often "interface" failures; that is, GPT often has trouble formulating problems in a way that elicits useful answers from the plug-ins. Fixing these interface failures seems like a central challenge in making GPT a reliable tool for college-level calculation problems.

Testing GPT-4 with Wolfram Alpha and Code Interpreter plug-ins on math and science problems

TL;DR

and

on Arbitrary Numerical,

and

on Calculation-Free, and

and

on Motivated Numerical for WA and CI, respectively. The work argues for improved plug-in interfaces and cautions that while these systems approach undergraduate-level competence on some tasks, they are not yet reliable enough for autonomous college-level calculation workloads.

Abstract

Paper Structure (13 sections, 1 equation, 7 tables)

This paper contains 13 sections, 1 equation, 7 tables.

Summary of Conclusions
The test sets: Overview
The "Arbitrary Numerical" test set: Overview
The "Calculation-Free" test set: Overview
The "Motivated Numerical" test set: Overview
History of the AI systems and the testing project
The Design and Testing Process
Results
Related Work
SciBench
Other work
Discussion
Methodological observations

Testing GPT-4 with Wolfram Alpha and Code Interpreter plug-ins on math and science problems

TL;DR

Abstract

Testing GPT-4 with Wolfram Alpha and Code Interpreter plug-ins on math and science problems

Authors

TL;DR

Abstract

Table of Contents