Does GPT Really Get It? A Hierarchical Scale to Quantify Human vs AI's Understanding of Algorithms

Mirabel Reid; Santosh S. Vempala

Does GPT Really Get It? A Hierarchical Scale to Quantify Human vs AI's Understanding of Algorithms

Mirabel Reid, Santosh S. Vempala

TL;DR

This work defines a precise, testable notion of algorithm understanding grounded in the Understanding as Representation Manipulability (URM) framework and introduces a hierarchical five-level scale to measure depth of understanding. It empirically validates the scale by comparing human learners (undergraduates and graduates) with multiple generations of GPT on two canonical algorithms: the Euclidean algorithm for GCD and the Ford-Fulkerson max-flow method. The results show GPT-4 achieving functional, near-graduate level understanding of these algorithms, while displaying a stronger bias toward language reasoning than mathematical reasoning, though code-generation tasks are exceptional across models. The study provides a principled, extensible method to track AI progress in algorithmic understanding and highlights challenges such as hallucinations, hedging, and the limits of current LLMs in robust mathematical reasoning, with implications for AI-assisted teaching and software development.

Abstract

As Large Language Models (LLMs) perform (and sometimes excel at) more and more complex cognitive tasks, a natural question is whether AI really understands. The study of understanding in LLMs is in its infancy, and the community has yet to incorporate well-trodden research in philosophy, psychology, and education. We initiate this, specifically focusing on understanding algorithms, and propose a hierarchy of levels of understanding. We use the hierarchy to design and conduct a study with human subjects (undergraduate and graduate students) as well as large language models (generations of GPT), revealing interesting similarities and differences. We expect that our rigorous criteria will be useful to keep track of AI's progress in such cognitive domains.

Does GPT Really Get It? A Hierarchical Scale to Quantify Human vs AI's Understanding of Algorithms

TL;DR

Abstract

Paper Structure (36 sections, 10 figures)

This paper contains 36 sections, 10 figures.

Introduction
Motivation: Why Study Algorithm Understanding?
Related Work
Cognitive Abilities of LLMs.
Understanding in LLMs.
Theories of Understanding.
A Definition of Understanding
Preliminaries
Internal Representations
Levels of Understanding
Hypotheses
Methods
Experimental Design
Human Survey
LLM Experiments
...and 21 more sections

Figures (10)

Figure 1: A hierarchy of understanding.
Figure 2: The average scores across students who self-reported that they understood the algorithm. Number of records is $n=13$ (undergraduate) and $n=10$ (graduate) respectively. The average scores for GPT-4 are across 60 randomized versions of the surveys. Error bars are 95% confidence intervals.
Figure 3: The average score between three versions of GPT, across 30 random surveys for each of GCD and Max Flow. Error bars show the 95% confidence interval.
Figure 4: The distribution of scores per question for GPT-4.
Figure 5: The difference in mean performance between mathematical and natural language reasoning tasks on Ford-Fulkerson (Left) and the Euclidean algorithm (right). The top graphs show tasks at Level 4, while the bottom graphs show tasks at Level 5
...and 5 more figures

Theorems & Definitions (7)

Definition : Level 1: Execution
Definition : Level 2: Step-By-Step Evaluation
Definition : Level 3: Representation
Definition : Level 4a: Exemplification
Definition : Level 4b: Explanation
Definition : Level 5a: Extrapolation
Definition : Level 5b: Counterfactual Reasoning

Does GPT Really Get It? A Hierarchical Scale to Quantify Human vs AI's Understanding of Algorithms

TL;DR

Abstract

Does GPT Really Get It? A Hierarchical Scale to Quantify Human vs AI's Understanding of Algorithms

Authors

TL;DR

Abstract

Table of Contents

Figures (10)

Theorems & Definitions (7)