Table of Contents
Fetching ...

CodeClash: Benchmarking Goal-Oriented Software Engineering

John Yang, Kilian Lieret, Joyce Yang, Carlos E. Jimenez, Ofir Press, Ludwig Schmidt, Diyi Yang

TL;DR

CodeClash tackles the deficit in benchmarks for long-horizon, goal-driven software development by hosting multi-round tournaments where LM-driven codebases compete in diverse arenas. The framework uses log-based feedback, fixed edit budgets, and head-to-head competition to study strategic development, code organization, and maintenance, revealing creative but often unstable approaches and a persistent gap to expert humans. Across 1,680 tournaments, models show diverse styles yet struggle with interpreting competitive feedback and validating changes, prompting future work on improved reasoning and maintainable code practices. The authors provide an open-source toolkit and leaderboard to facilitate ongoing research into autonomous, goal-oriented code evolution.

Abstract

Current benchmarks for coding evaluate language models (LMs) on concrete, well-specified tasks such as fixing specific bugs or writing targeted tests. However, human programmers do not spend all day incessantly addressing isolated tasks. Instead, real-world software development is grounded in the pursuit of high-level goals, like improving user retention or reducing costs. Evaluating whether LMs can also iteratively develop code to better accomplish open-ended objectives without any explicit guidance remains an open challenge. To address this, we introduce CodeClash, a benchmark where LMs compete in multi-round tournaments to build the best codebase for achieving a competitive objective. Each round proceeds in two phases: agents edit their code, then their codebases compete head-to-head in a code arena that determines winners based on objectives like score maximization, resource acquisition, or survival. Whether it's writing notes, scrutinizing documentation, analyzing competition logs, or creating test suites, models must decide for themselves how to improve their codebases both absolutely and against their opponents. We run 1680 tournaments (25,200 rounds total) to evaluate 8 LMs across 6 arenas. Our results reveal that while models exhibit diverse development styles, they share fundamental limitations in strategic reasoning. Models also struggle with long-term codebase maintenance, as repositories become progressively messy and redundant. These limitations are stark: top models lose every round against expert human programmers. We open-source CodeClash to advance the study of autonomous, goal-oriented code development.

CodeClash: Benchmarking Goal-Oriented Software Engineering

TL;DR

CodeClash tackles the deficit in benchmarks for long-horizon, goal-driven software development by hosting multi-round tournaments where LM-driven codebases compete in diverse arenas. The framework uses log-based feedback, fixed edit budgets, and head-to-head competition to study strategic development, code organization, and maintenance, revealing creative but often unstable approaches and a persistent gap to expert humans. Across 1,680 tournaments, models show diverse styles yet struggle with interpreting competitive feedback and validating changes, prompting future work on improved reasoning and maintainable code practices. The authors provide an open-source toolkit and leaderboard to facilitate ongoing research into autonomous, goal-oriented code evolution.

Abstract

Current benchmarks for coding evaluate language models (LMs) on concrete, well-specified tasks such as fixing specific bugs or writing targeted tests. However, human programmers do not spend all day incessantly addressing isolated tasks. Instead, real-world software development is grounded in the pursuit of high-level goals, like improving user retention or reducing costs. Evaluating whether LMs can also iteratively develop code to better accomplish open-ended objectives without any explicit guidance remains an open challenge. To address this, we introduce CodeClash, a benchmark where LMs compete in multi-round tournaments to build the best codebase for achieving a competitive objective. Each round proceeds in two phases: agents edit their code, then their codebases compete head-to-head in a code arena that determines winners based on objectives like score maximization, resource acquisition, or survival. Whether it's writing notes, scrutinizing documentation, analyzing competition logs, or creating test suites, models must decide for themselves how to improve their codebases both absolutely and against their opponents. We run 1680 tournaments (25,200 rounds total) to evaluate 8 LMs across 6 arenas. Our results reveal that while models exhibit diverse development styles, they share fundamental limitations in strategic reasoning. Models also struggle with long-term codebase maintenance, as repositories become progressively messy and redundant. These limitations are stark: top models lose every round against expert human programmers. We open-source CodeClash to advance the study of autonomous, goal-oriented code development.

Paper Structure

This paper contains 24 sections, 23 figures, 9 tables.

Figures (23)

  • Figure 1: CodeClash is a benchmark where players (LMs as SWE-agents) compete in programming tournaments spanning multiple rounds. Per round, models edit their codebases (edit phase) before the codebases face off in a code arena (competition phase). Then, the competition logs are copied back into the codebases and the next round begins.
  • Figure 2: Model win rates (row beats column). Win rate is the proportion of tournaments (out of $240$) won across all arenas. Claude Sonnet 4.5 has the highest average win rate at $69.9$%.
  • Figure 3: Win rates across rounds, illustrating how different models gain (Claude Sonnet 4.5) or lose momentum (GPT-5) over the course of the tournament.
  • Figure 4: Probability of winning the next round after losing several rounds in a row. Even the highest ranking models struggle to recover after losing one or more consecutive rounds in a tournament. Numbers in parentheses indicate the overall average win rate.
  • Figure 5: To measure solution diversity, we compute code similarity of each model's solutions to itself at the same round. Each data point represents the mean pairwise similarity between a model's solution (main.py) at round n across $70$ BattleSnake tournaments.
  • ...and 18 more figures