Table of Contents
Fetching ...

Empirical Computation

Eric Tang, Marcel Böhme

TL;DR

The paper defines empirical computation as solving problems by prompting models to produce the most likely solution rather than guaranteed correctness, challenging the sufficiency of formal computation theories for analyzing such systems. Through preliminary experiments on sorting, searching, and related tasks using an LLM-based tool, it shows that while execution time can be largely independent of problem complexity (e.g., $O(n\log n)$ benchmarks for sorting do not constrain prompt-based solutions), correctness declines as problem size grows and depends on input representation and familiarity. The authors present a framework of questions and methods for assessing the properties of empirical computation, including notions of statistical guarantees, problem encodings, and methods to predict or improve correctness, arguing for new software engineering tools and techniques to study trustworthiness in AI-driven computation.

Abstract

In this vision paper, we explore the challenges and opportunities of a form of computation that employs an empirical (rather than a formal) approach, where the solution of a computational problem is returned as empirically most likely (rather than necessarily correct). We call this approach as *empirical computation* and observe that its capabilities and limits *cannot* be understood within the classic, rationalist framework of computation. While we take a very broad view of "computational problem", a classic, well-studied example is *sorting*: Given a set of $n$ numbers, return these numbers sorted in ascending order. * To run a classical, *formal computation*, we might first think about a *specific algorithm* (e.g., merge sort) before developing a *specific* program that implements it. The program will expect the input to be given in a *specific* format, type, or data structure (e.g., unsigned 32-bit integers). In software engineering, we have many approaches to analyze the correctness of such programs. From complexity theory, we know that there exists no correct program that can solve the average instance of the sorting problem faster than $O(n\log n)$. * To run an *empirical computation*, we might directly ask a large language model (LLM) to solve *any* computational problem (which can be stated informally in natural language) and provide the input in *any* format (e.g., negative numbers written as Chinese characters). There is no (problem-specific) program that could be analyzed for correctness. Also, the time it takes an LLM to return an answer is entirely *independent* of the computational complexity of the problem that is solved. What are the capabilities or limits of empirical computation in the general, in the problem-, or in the instance-specific? Our purpose is to establish empirical computation as a field in SE that is timely and rich with interesting problems.

Empirical Computation

TL;DR

The paper defines empirical computation as solving problems by prompting models to produce the most likely solution rather than guaranteed correctness, challenging the sufficiency of formal computation theories for analyzing such systems. Through preliminary experiments on sorting, searching, and related tasks using an LLM-based tool, it shows that while execution time can be largely independent of problem complexity (e.g., benchmarks for sorting do not constrain prompt-based solutions), correctness declines as problem size grows and depends on input representation and familiarity. The authors present a framework of questions and methods for assessing the properties of empirical computation, including notions of statistical guarantees, problem encodings, and methods to predict or improve correctness, arguing for new software engineering tools and techniques to study trustworthiness in AI-driven computation.

Abstract

In this vision paper, we explore the challenges and opportunities of a form of computation that employs an empirical (rather than a formal) approach, where the solution of a computational problem is returned as empirically most likely (rather than necessarily correct). We call this approach as *empirical computation* and observe that its capabilities and limits *cannot* be understood within the classic, rationalist framework of computation. While we take a very broad view of "computational problem", a classic, well-studied example is *sorting*: Given a set of numbers, return these numbers sorted in ascending order. * To run a classical, *formal computation*, we might first think about a *specific algorithm* (e.g., merge sort) before developing a *specific* program that implements it. The program will expect the input to be given in a *specific* format, type, or data structure (e.g., unsigned 32-bit integers). In software engineering, we have many approaches to analyze the correctness of such programs. From complexity theory, we know that there exists no correct program that can solve the average instance of the sorting problem faster than . * To run an *empirical computation*, we might directly ask a large language model (LLM) to solve *any* computational problem (which can be stated informally in natural language) and provide the input in *any* format (e.g., negative numbers written as Chinese characters). There is no (problem-specific) program that could be analyzed for correctness. Also, the time it takes an LLM to return an answer is entirely *independent* of the computational complexity of the problem that is solved. What are the capabilities or limits of empirical computation in the general, in the problem-, or in the instance-specific? Our purpose is to establish empirical computation as a field in SE that is timely and rich with interesting problems.

Paper Structure

This paper contains 10 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: LLM-performance on various well-studied computational problems. Top: Average time to solution. Middle: Average proportion of correct solutions (among 30+ repetitions). Bottom: Expected time to generate the first correct solution.
  • Figure 2: Correctness of empirical sorting for various languages (measured as the proportion of correctly sorted lists).