Testing GPT-4 with Wolfram Alpha and Code Interpreter plug-ins on math and science problems
Ernest Davis, Scott Aaronson
TL;DR
This study evaluates GPT-4 augmented with Wolfram Alpha and Code Interpreter plug-ins on 105 original high school and college level science and math problems, organized into three test sets to probe numerical, calculation-free, and motivated numerical tasks. The authors report that plug-ins substantially improve performance over GPT-4 alone, but observe pervasive interface failures and a gap between plug-in capability and effective use, suggesting the need for interactive, iterative problem solving with human oversight. Quantitatively, the results show mixed success across sets, with $8.25/32$ and $10/32$ on Arbitrary Numerical, $30.7/53$ and $34.2/53$ on Calculation-Free, and $14.3/20$ and $13.8/20$ on Motivated Numerical for WA and CI, respectively. The work argues for improved plug-in interfaces and cautions that while these systems approach undergraduate-level competence on some tasks, they are not yet reliable enough for autonomous college-level calculation workloads.
Abstract
This report describes a test of the large language model GPT-4 with the Wolfram Alpha and the Code Interpreter plug-ins on 105 original problems in science and math, at the high school and college levels, carried out in June-August 2023. Our tests suggest that the plug-ins significantly enhance GPT's ability to solve these problems. Having said that, there are still often "interface" failures; that is, GPT often has trouble formulating problems in a way that elicits useful answers from the plug-ins. Fixing these interface failures seems like a central challenge in making GPT a reliable tool for college-level calculation problems.
