Table of Contents
Fetching ...

Does GPT Really Get It? A Hierarchical Scale to Quantify Human vs AI's Understanding of Algorithms

Mirabel Reid, Santosh S. Vempala

TL;DR

This work defines a precise, testable notion of algorithm understanding grounded in the Understanding as Representation Manipulability (URM) framework and introduces a hierarchical five-level scale to measure depth of understanding. It empirically validates the scale by comparing human learners (undergraduates and graduates) with multiple generations of GPT on two canonical algorithms: the Euclidean algorithm for GCD and the Ford-Fulkerson max-flow method. The results show GPT-4 achieving functional, near-graduate level understanding of these algorithms, while displaying a stronger bias toward language reasoning than mathematical reasoning, though code-generation tasks are exceptional across models. The study provides a principled, extensible method to track AI progress in algorithmic understanding and highlights challenges such as hallucinations, hedging, and the limits of current LLMs in robust mathematical reasoning, with implications for AI-assisted teaching and software development.

Abstract

As Large Language Models (LLMs) perform (and sometimes excel at) more and more complex cognitive tasks, a natural question is whether AI really understands. The study of understanding in LLMs is in its infancy, and the community has yet to incorporate well-trodden research in philosophy, psychology, and education. We initiate this, specifically focusing on understanding algorithms, and propose a hierarchy of levels of understanding. We use the hierarchy to design and conduct a study with human subjects (undergraduate and graduate students) as well as large language models (generations of GPT), revealing interesting similarities and differences. We expect that our rigorous criteria will be useful to keep track of AI's progress in such cognitive domains.

Does GPT Really Get It? A Hierarchical Scale to Quantify Human vs AI's Understanding of Algorithms

TL;DR

This work defines a precise, testable notion of algorithm understanding grounded in the Understanding as Representation Manipulability (URM) framework and introduces a hierarchical five-level scale to measure depth of understanding. It empirically validates the scale by comparing human learners (undergraduates and graduates) with multiple generations of GPT on two canonical algorithms: the Euclidean algorithm for GCD and the Ford-Fulkerson max-flow method. The results show GPT-4 achieving functional, near-graduate level understanding of these algorithms, while displaying a stronger bias toward language reasoning than mathematical reasoning, though code-generation tasks are exceptional across models. The study provides a principled, extensible method to track AI progress in algorithmic understanding and highlights challenges such as hallucinations, hedging, and the limits of current LLMs in robust mathematical reasoning, with implications for AI-assisted teaching and software development.

Abstract

As Large Language Models (LLMs) perform (and sometimes excel at) more and more complex cognitive tasks, a natural question is whether AI really understands. The study of understanding in LLMs is in its infancy, and the community has yet to incorporate well-trodden research in philosophy, psychology, and education. We initiate this, specifically focusing on understanding algorithms, and propose a hierarchy of levels of understanding. We use the hierarchy to design and conduct a study with human subjects (undergraduate and graduate students) as well as large language models (generations of GPT), revealing interesting similarities and differences. We expect that our rigorous criteria will be useful to keep track of AI's progress in such cognitive domains.
Paper Structure (36 sections, 10 figures)

This paper contains 36 sections, 10 figures.

Figures (10)

  • Figure 1: A hierarchy of understanding.
  • Figure 2: The average scores across students who self-reported that they understood the algorithm. Number of records is $n=13$ (undergraduate) and $n=10$ (graduate) respectively. The average scores for GPT-4 are across 60 randomized versions of the surveys. Error bars are 95% confidence intervals.
  • Figure 3: The average score between three versions of GPT, across 30 random surveys for each of GCD and Max Flow. Error bars show the 95% confidence interval.
  • Figure 4: The distribution of scores per question for GPT-4.
  • Figure 5: The difference in mean performance between mathematical and natural language reasoning tasks on Ford-Fulkerson (Left) and the Euclidean algorithm (right). The top graphs show tasks at Level 4, while the bottom graphs show tasks at Level 5
  • ...and 5 more figures

Theorems & Definitions (7)

  • Definition : Level 1: Execution
  • Definition : Level 2: Step-By-Step Evaluation
  • Definition : Level 3: Representation
  • Definition : Level 4a: Exemplification
  • Definition : Level 4b: Explanation
  • Definition : Level 5a: Extrapolation
  • Definition : Level 5b: Counterfactual Reasoning