Table of Contents
Fetching ...

Can OpenAI o1 outperform humans in higher-order cognitive thinking?

Ehsan Latif, Yifan Zhou, Shuchen Guo, Lehong Shi, Yizhu Gao, Matthew Nyaaba, Arne Bewerdorff, Xiantong Yang, Xiaoming Zhai

TL;DR

This study benchmarks OpenAI's o1-preview across seven higher-order thinking domains (critical, systematic, computational thinking, data literacy, creative thinking, logical reasoning, and scientific reasoning) against human performance using established instruments. Across six of seven domains, o1-preview surpasses human benchmarks, notably in systematic thinking, data literacy, creative thinking, logical reasoning, and scientific reasoning, with near-perfect performance on TOSLS and perfect scores on ATTA for algorithmic thinking. The model excels in structured, well-defined tasks but shows limitations in unstructured problem-solving and adaptive reasoning, indicating it should complement rather than replace human cognition in education. The findings highlight both the educational potential and the need for ethical oversight, better assessment design, and ongoing refinement to ensure safe, equitable, and holistic AI-supported learning outcomes.

Abstract

This study evaluates the performance of OpenAI's o1-preview model in higher-order cognitive domains, including critical thinking, systematic thinking, computational thinking, data literacy, creative thinking, logical reasoning, and scientific reasoning. Using established benchmarks, we compared the o1-preview models's performance to human participants from diverse educational levels. o1-preview achieved a mean score of 24.33 on the Ennis-Weir Critical Thinking Essay Test (EWCTET), surpassing undergraduate (13.8) and postgraduate (18.39) participants (z = 1.60 and 0.90, respectively). In systematic thinking, it scored 46.1, SD = 4.12 on the Lake Urmia Vignette, significantly outperforming the human mean (20.08, SD = 8.13, z = 3.20). For data literacy, o1-preview scored 8.60, SD = 0.70 on Merk et al.'s "Use Data" dimension, compared to the human post-test mean of 4.17, SD = 2.02 (z = 2.19). On creative thinking tasks, the model achieved originality scores of 2.98, SD = 0.73, higher than the human mean of 1.74 (z = 0.71). In logical reasoning (LogiQA), it outperformed humans with average 90%, SD = 10% accuracy versus 86%, SD = 6.5% (z = 0.62). For scientific reasoning, it achieved near-perfect performance (mean = 0.99, SD = 0.12) on the TOSLS,, exceeding the highest human scores of 0.85, SD = 0.13 (z = 1.78). While o1-preview excelled in structured tasks, it showed limitations in problem-solving and adaptive reasoning. These results demonstrate the potential of AI to complement education in structured assessments but highlight the need for ethical oversight and refinement for broader applications.

Can OpenAI o1 outperform humans in higher-order cognitive thinking?

TL;DR

This study benchmarks OpenAI's o1-preview across seven higher-order thinking domains (critical, systematic, computational thinking, data literacy, creative thinking, logical reasoning, and scientific reasoning) against human performance using established instruments. Across six of seven domains, o1-preview surpasses human benchmarks, notably in systematic thinking, data literacy, creative thinking, logical reasoning, and scientific reasoning, with near-perfect performance on TOSLS and perfect scores on ATTA for algorithmic thinking. The model excels in structured, well-defined tasks but shows limitations in unstructured problem-solving and adaptive reasoning, indicating it should complement rather than replace human cognition in education. The findings highlight both the educational potential and the need for ethical oversight, better assessment design, and ongoing refinement to ensure safe, equitable, and holistic AI-supported learning outcomes.

Abstract

This study evaluates the performance of OpenAI's o1-preview model in higher-order cognitive domains, including critical thinking, systematic thinking, computational thinking, data literacy, creative thinking, logical reasoning, and scientific reasoning. Using established benchmarks, we compared the o1-preview models's performance to human participants from diverse educational levels. o1-preview achieved a mean score of 24.33 on the Ennis-Weir Critical Thinking Essay Test (EWCTET), surpassing undergraduate (13.8) and postgraduate (18.39) participants (z = 1.60 and 0.90, respectively). In systematic thinking, it scored 46.1, SD = 4.12 on the Lake Urmia Vignette, significantly outperforming the human mean (20.08, SD = 8.13, z = 3.20). For data literacy, o1-preview scored 8.60, SD = 0.70 on Merk et al.'s "Use Data" dimension, compared to the human post-test mean of 4.17, SD = 2.02 (z = 2.19). On creative thinking tasks, the model achieved originality scores of 2.98, SD = 0.73, higher than the human mean of 1.74 (z = 0.71). In logical reasoning (LogiQA), it outperformed humans with average 90%, SD = 10% accuracy versus 86%, SD = 6.5% (z = 0.62). For scientific reasoning, it achieved near-perfect performance (mean = 0.99, SD = 0.12) on the TOSLS,, exceeding the highest human scores of 0.85, SD = 0.13 (z = 1.78). While o1-preview excelled in structured tasks, it showed limitations in problem-solving and adaptive reasoning. These results demonstrate the potential of AI to complement education in structured assessments but highlight the need for ethical oversight and refinement for broader applications.

Paper Structure

This paper contains 45 sections, 1 figure, 11 tables.

Figures (1)

  • Figure 1: Performance overview of OpenAI o1-preview in higher-order thinking domains compared to human experts.