Executing Natural Language-Described Algorithms with Large Language Models: An Investigation
Xin Zheng, Qiming Zhu, Hongyu Lin, Yaojie Lu, Xianpei Han, Le Sun
TL;DR
The paper evaluates whether modern LLMs can execute algorithms described in natural language by constructing a CLRS-based test suite (CLRS-mini and CLRS-Numeric) and a rigorous prompting framework that enforces stepwise execution. Results show GPT-4 attains near-perfect performance on algorithms with clear control flow and arithmetic, while numerically intensive tasks remain out of reach without external tooling; GPT-3.5 variants underperform, especially on graph- and recursive-heavy tasks due to context-length limits. The study also analyzes intermediate results, ablations, and data-leakage concerns, concluding that detailed natural-language prompts strongly enhance faithful program execution. Overall, the work demonstrates that LLMs can mimic Von-Neumann-like execution for many algorithms, providing a foundation for further exploration of computation with LLMs and for developing robust benchmarks.
Abstract
Executing computer programs described in natural language has long been a pursuit of computer science. With the advent of enhanced natural language understanding capabilities exhibited by large language models (LLMs), the path toward this goal has been illuminated. In this paper, we seek to examine the capacity of present-day LLMs to comprehend and execute algorithms outlined in natural language. We established an algorithm test set sourced from Introduction to Algorithm, a well-known textbook that contains many representative widely-used algorithms. To systematically assess LLMs' code execution abilities, we selected 30 algorithms, generated 300 random-sampled instances in total, and evaluated whether popular LLMs can understand and execute these algorithms. Our findings reveal that LLMs, notably GPT-4, can effectively execute programs described in natural language, as long as no heavy numeric computation is involved. We believe our findings contribute to evaluating LLMs' code execution abilities and would encourage further investigation and application for the computation power of LLMs.
