Table of Contents
Fetching ...

ProgramBench: Can Language Models Rebuild Programs From Scratch?

John Yang, Kilian Lieret, Jeffrey Ma, Parth Thakkar, Dmitrii Pedchenko, Sten Sootla, Emily McMilin, Pengcheng Yin, Rui Hou, Gabriel Synnaeve, Diyi Yang, Ofir Press

Abstract

Turning ideas into full software projects from scratch has become a popular use case for language models. Agents are being deployed to seed, maintain, and grow codebases over extended periods with minimal human oversight. Such settings require models to make high-level software architecture decisions. However, existing benchmarks measure focused, limited tasks such as fixing a single bug or developing a single, specified feature. We therefore introduce ProgramBench to measure the ability of software engineering agents to develop software holisitically. In ProgramBench, given only a program and its documentation, agents must architect and implement a codebase that matches the reference executable's behavior. End-to-end behavioral tests are generated via agent-driven fuzzing, enabling evaluation without prescribing implementation structure. Our 200 tasks range from compact CLI tools to widely used software such as FFmpeg, SQLite, and the PHP interpreter. We evaluate 9 LMs and find that none fully resolve any task, with the best model passing 95\% of tests on only 3\% of tasks. Models favor monolithic, single-file implementations that diverge sharply from human-written code.

ProgramBench: Can Language Models Rebuild Programs From Scratch?

Abstract

Turning ideas into full software projects from scratch has become a popular use case for language models. Agents are being deployed to seed, maintain, and grow codebases over extended periods with minimal human oversight. Such settings require models to make high-level software architecture decisions. However, existing benchmarks measure focused, limited tasks such as fixing a single bug or developing a single, specified feature. We therefore introduce ProgramBench to measure the ability of software engineering agents to develop software holisitically. In ProgramBench, given only a program and its documentation, agents must architect and implement a codebase that matches the reference executable's behavior. End-to-end behavioral tests are generated via agent-driven fuzzing, enabling evaluation without prescribing implementation structure. Our 200 tasks range from compact CLI tools to widely used software such as FFmpeg, SQLite, and the PHP interpreter. We evaluate 9 LMs and find that none fully resolve any task, with the best model passing 95\% of tests on only 3\% of tasks. Models favor monolithic, single-file implementations that diverge sharply from human-written code.

Paper Structure

This paper contains 34 sections, 32 figures, 11 tables.

Figures (32)

  • Figure 1: ProgramBench evaluates models on their ability to write software projects from scratch. Given a software program (e.g., executable) and its documentation, a software engineering agent (SWE-agent) is tasked with producing source code and a build script that reconstructs the original program's behavior.
  • Figure 2: ProgramBench task collection pipeline. To turn a GitHub repository into an ProgramBench task, we use a SWE-agent to compile an executable, generate behavioral tests, and strip away implementation details. The sourcing workflow only requires a repository to produce an executable or program, making it extensible to many codebases.
  • Figure 2: Main results on ProgramBench. % Resolved is the primary metric: the fraction of 200 tasks where all tests pass. % Almost relaxes this to instances with $\geq$95% of tests passing. We also report average API calls and cost per task.
  • Figure 3: Distribution of programming languages across ProgramBench task instances. To solve the task, models may write their solution in any language they choose.
  • Figure 4: Cumulative distribution of test pass rates across models on ProgramBench.
  • ...and 27 more figures