Table of Contents
Fetching ...

Can generative AI and ChatGPT outperform humans on cognitive-demanding problem-solving tasks in science?

Xiaoming Zhai, Matthew Nyaaba, Wenchao Ma

TL;DR

Can generative AI tools outperform humans on cognitively demanding science tasks? The authors compare ChatGPT and GPT-4 against NAEP 2019 science items across grades 4, 8, and 12, coding items with a two-dimensional cognitive-load framework and scoring AI outputs via NAEP rubrics. They find that both AI models consistently outperform most students on individual items, with high agreement between the two models, and that higher cognitive-demand levels require higher average student ability scores, though AI performance generally remains less sensitive to cognitive load (especially at grades 8 and 12). The results have implications for educational objectives and assessment design, suggesting a shift toward higher-order thinking, creativity, and AI-literate evaluation practices, while highlighting ethical considerations and the need for teacher preparation. The study also notes limitations due to data access and calls for further research with granular data and varied modalities.

Abstract

This study aimed to examine an assumption that generative artificial intelligence (GAI) tools can overcome the cognitive intensity that humans suffer when solving problems. We compared the performance of ChatGPT and GPT-4 on 2019 NAEP science assessments with students by cognitive demands of the items. Fifty-four tasks were coded by experts using a two-dimensional cognitive load framework, including task cognitive complexity and dimensionality. ChatGPT and GPT-4 responses were scored using the scoring keys of NAEP. The analysis of the available data was based on the average student ability scores for students who answered each item correctly and the percentage of students who responded to individual items. Results showed that both ChatGPT and GPT-4 consistently outperformed most students who answered the NAEP science assessments. As the cognitive demand for NAEP tasks increases, statistically higher average student ability scores are required to correctly address the questions. This pattern was observed for students in grades 4, 8, and 12, respectively. However, ChatGPT and GPT-4 were not statistically sensitive to the increase in cognitive demands of the tasks, except for Grade 4. As the first study focusing on comparing GAI and K-12 students in problem-solving in science, this finding implies the need for changes to educational objectives to prepare students with competence to work with GAI tools in the future. Education ought to emphasize the cultivation of advanced cognitive skills rather than depending solely on tasks that demand cognitive intensity. This approach would foster critical thinking, analytical skills, and the application of knowledge in novel contexts. Findings also suggest the need for innovative assessment practices by moving away from cognitive intensity tasks toward creativity and analytical skills to avoid the negative effects of GAI on testing more efficiently.

Can generative AI and ChatGPT outperform humans on cognitive-demanding problem-solving tasks in science?

TL;DR

Can generative AI tools outperform humans on cognitively demanding science tasks? The authors compare ChatGPT and GPT-4 against NAEP 2019 science items across grades 4, 8, and 12, coding items with a two-dimensional cognitive-load framework and scoring AI outputs via NAEP rubrics. They find that both AI models consistently outperform most students on individual items, with high agreement between the two models, and that higher cognitive-demand levels require higher average student ability scores, though AI performance generally remains less sensitive to cognitive load (especially at grades 8 and 12). The results have implications for educational objectives and assessment design, suggesting a shift toward higher-order thinking, creativity, and AI-literate evaluation practices, while highlighting ethical considerations and the need for teacher preparation. The study also notes limitations due to data access and calls for further research with granular data and varied modalities.

Abstract

This study aimed to examine an assumption that generative artificial intelligence (GAI) tools can overcome the cognitive intensity that humans suffer when solving problems. We compared the performance of ChatGPT and GPT-4 on 2019 NAEP science assessments with students by cognitive demands of the items. Fifty-four tasks were coded by experts using a two-dimensional cognitive load framework, including task cognitive complexity and dimensionality. ChatGPT and GPT-4 responses were scored using the scoring keys of NAEP. The analysis of the available data was based on the average student ability scores for students who answered each item correctly and the percentage of students who responded to individual items. Results showed that both ChatGPT and GPT-4 consistently outperformed most students who answered the NAEP science assessments. As the cognitive demand for NAEP tasks increases, statistically higher average student ability scores are required to correctly address the questions. This pattern was observed for students in grades 4, 8, and 12, respectively. However, ChatGPT and GPT-4 were not statistically sensitive to the increase in cognitive demands of the tasks, except for Grade 4. As the first study focusing on comparing GAI and K-12 students in problem-solving in science, this finding implies the need for changes to educational objectives to prepare students with competence to work with GAI tools in the future. Education ought to emphasize the cultivation of advanced cognitive skills rather than depending solely on tasks that demand cognitive intensity. This approach would foster critical thinking, analytical skills, and the application of knowledge in novel contexts. Findings also suggest the need for innovative assessment practices by moving away from cognitive intensity tasks toward creativity and analytical skills to avoid the negative effects of GAI on testing more efficiently.
Paper Structure (13 sections, 2 equations, 4 figures, 4 tables)

This paper contains 13 sections, 2 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Explain how to produce sounds (NAEP, Science, Grade: 4, Year: 2019)
  • Figure 2: percentage of students in Grade 4 scored below ChatGPT or GPT4 for each item
  • Figure 3: percentage of students in Grade 8 scored below ChatGPT or GPT4 for each item
  • Figure 4: percentage of students in Grade 12 scored below ChatGPT or GPT4 for each item