Does GPT Really Get It? A Hierarchical Scale to Quantify Human vs AI's Understanding of Algorithms
Mirabel Reid, Santosh S. Vempala
TL;DR
This work defines a precise, testable notion of algorithm understanding grounded in the Understanding as Representation Manipulability (URM) framework and introduces a hierarchical five-level scale to measure depth of understanding. It empirically validates the scale by comparing human learners (undergraduates and graduates) with multiple generations of GPT on two canonical algorithms: the Euclidean algorithm for GCD and the Ford-Fulkerson max-flow method. The results show GPT-4 achieving functional, near-graduate level understanding of these algorithms, while displaying a stronger bias toward language reasoning than mathematical reasoning, though code-generation tasks are exceptional across models. The study provides a principled, extensible method to track AI progress in algorithmic understanding and highlights challenges such as hallucinations, hedging, and the limits of current LLMs in robust mathematical reasoning, with implications for AI-assisted teaching and software development.
Abstract
As Large Language Models (LLMs) perform (and sometimes excel at) more and more complex cognitive tasks, a natural question is whether AI really understands. The study of understanding in LLMs is in its infancy, and the community has yet to incorporate well-trodden research in philosophy, psychology, and education. We initiate this, specifically focusing on understanding algorithms, and propose a hierarchy of levels of understanding. We use the hierarchy to design and conduct a study with human subjects (undergraduate and graduate students) as well as large language models (generations of GPT), revealing interesting similarities and differences. We expect that our rigorous criteria will be useful to keep track of AI's progress in such cognitive domains.
