Table of Contents
Fetching ...

Large Language Models are Algorithmically Blind

Sohan Venkatesh, Ashish Mahendran Kurapath, Tejas Melkote

TL;DR

This work evaluates eight frontier LLMs against ground truth derived from large-scale algorithm executions and finds systematic, near-total failure, which reflects a fundamental gap between declarative knowledge about algorithms and calibrated procedural prediction.

Abstract

Large language models (LLMs) demonstrate remarkable breadth of knowledge, yet their ability to reason about computational processes remains poorly understood. Closing this gap matters for practitioners who rely on LLMs to guide algorithm selection and deployment. We address this limitation using causal discovery as a testbed and evaluate eight frontier LLMs against ground truth derived from large-scale algorithm executions and find systematic, near-total failure. Models produce ranges far wider than true confidence intervals yet still fail to contain the true algorithmic mean in the majority of instances; most perform worse than random guessing and the marginal above-random performance of the best model is most consistent with benchmark memorization rather than principled reasoning. We term this failure algorithmic blindness and argue it reflects a fundamental gap between declarative knowledge about algorithms and calibrated procedural prediction.

Large Language Models are Algorithmically Blind

TL;DR

This work evaluates eight frontier LLMs against ground truth derived from large-scale algorithm executions and finds systematic, near-total failure, which reflects a fundamental gap between declarative knowledge about algorithms and calibrated procedural prediction.

Abstract

Large language models (LLMs) demonstrate remarkable breadth of knowledge, yet their ability to reason about computational processes remains poorly understood. Closing this gap matters for practitioners who rely on LLMs to guide algorithm selection and deployment. We address this limitation using causal discovery as a testbed and evaluate eight frontier LLMs against ground truth derived from large-scale algorithm executions and find systematic, near-total failure. Models produce ranges far wider than true confidence intervals yet still fail to contain the true algorithmic mean in the majority of instances; most perform worse than random guessing and the marginal above-random performance of the best model is most consistent with benchmark memorization rather than principled reasoning. We term this failure algorithmic blindness and argue it reflects a fundamental gap between declarative knowledge about algorithms and calibrated procedural prediction.
Paper Structure (41 sections, 9 equations, 8 figures, 15 tables)

This paper contains 41 sections, 9 equations, 8 figures, 15 tables.

Figures (8)

  • Figure 1: Comparison of LLM estimates and algorithmic ground truth revealing algorithmic blindness.
  • Figure 2: Methodology overview. LLMs are prompted with dataset characteristics and algorithmic assumptions to predict performance metric ranges (Precision, Recall, F1, SHD). Ground-truth metrics with bootstrap 95% confidence intervals are computed via large-scale executions and calibration is evaluated using interval coverage.
  • Figure 3: Mean calibrated coverage on benchmark datasets versus synthetic datasets across all models.
  • Figure 4: Coverage collapse across synthetic networks.
  • Figure 5: Predicted range width: benchmark vs. synthetic datasets
  • ...and 3 more figures