Large Language Models are Algorithmically Blind

Sohan Venkatesh; Ashish Mahendran Kurapath; Tejas Melkote

Large Language Models are Algorithmically Blind

Sohan Venkatesh, Ashish Mahendran Kurapath, Tejas Melkote

TL;DR

This work evaluates eight frontier LLMs against ground truth derived from large-scale algorithm executions and finds systematic, near-total failure, which reflects a fundamental gap between declarative knowledge about algorithms and calibrated procedural prediction.

Abstract

Large language models (LLMs) demonstrate remarkable breadth of knowledge, yet their ability to reason about computational processes remains poorly understood. Closing this gap matters for practitioners who rely on LLMs to guide algorithm selection and deployment. We address this limitation using causal discovery as a testbed and evaluate eight frontier LLMs against ground truth derived from large-scale algorithm executions and find systematic, near-total failure. Models produce ranges far wider than true confidence intervals yet still fail to contain the true algorithmic mean in the majority of instances; most perform worse than random guessing and the marginal above-random performance of the best model is most consistent with benchmark memorization rather than principled reasoning. We term this failure algorithmic blindness and argue it reflects a fundamental gap between declarative knowledge about algorithms and calibrated procedural prediction.

Large Language Models are Algorithmically Blind

TL;DR

Abstract

Paper Structure (41 sections, 9 equations, 8 figures, 15 tables)

This paper contains 41 sections, 9 equations, 8 figures, 15 tables.

Introduction
Related Work
Methodology
Ground Truth Computation
Datasets.
Algorithms.
Metrics.
LLM Query Protocol
Aggregation and Coverage Computation
Baselines
Prompt Robustness and Algorithm-Specific Degradation Analysis
Memorization Probes
Results
Systematic Calibration Failure
Most LLMs Fail to Exceed Random Guessing
...and 26 more sections

Figures (8)

Figure 1: Comparison of LLM estimates and algorithmic ground truth revealing algorithmic blindness.
Figure 2: Methodology overview. LLMs are prompted with dataset characteristics and algorithmic assumptions to predict performance metric ranges (Precision, Recall, F1, SHD). Ground-truth metrics with bootstrap 95% confidence intervals are computed via large-scale executions and calibration is evaluated using interval coverage.
Figure 3: Mean calibrated coverage on benchmark datasets versus synthetic datasets across all models.
Figure 4: Coverage collapse across synthetic networks.
Figure 5: Predicted range width: benchmark vs. synthetic datasets
...and 3 more figures

Large Language Models are Algorithmically Blind

TL;DR

Abstract

Large Language Models are Algorithmically Blind

Authors

TL;DR

Abstract

Table of Contents

Figures (8)