Table of Contents
Fetching ...

Can Vibe Coding Beat Graduate CS Students? An LLM vs. Human Coding Tournament on Market-driven Strategic Planning

Panayiotis Danassis, Naman Goel

TL;DR

This paper introduces the Auction, Pickup, and Delivery Problem (APDP), a real-world, multi-agent benchmark that requires LLM-generated code to bid strategically in a market and solve constrained vehicle-routing tasks to maximize profit. It systematically compares 40 LLM-coded agents against 17 human-coded agents across 12 double all-play-all tournaments on four topology graphs, revealing that graduate-student–coded agents consistently outperform the LLM-based solutions. Notably, the majority of LLM agents are beaten by simple baselines, and even the best LLM can degrade a strong human solution when asked to improve it, indicating gaps in reasoning and planning capabilities beyond code generation. The findings argue for new evaluations that stress reasoning-driven code synthesis in open-world, multi-agent settings and advocate open-source, open benchmarks to push the development of more capable, robust code-generation systems.

Abstract

The rapid proliferation of Large Language Models (LLMs) has revolutionized AI-assisted code generation. This rapid development of LLMs has outpaced our ability to properly benchmark them. Prevailing benchmarks emphasize unit-test pass rates and syntactic correctness. Such metrics understate the difficulty of many real-world problems that require planning, optimization, and strategic interaction. We introduce a multi-agent reasoning-driven benchmark based on a real-world logistics optimization problem (Auction, Pickup, and Delivery Problem) that couples competitive auctions with capacity-constrained routing. The benchmark requires building agents that can (i) bid strategically under uncertainty and (ii) optimize planners that deliver tasks while maximizing profit. We evaluate 40 LLM-coded agents (by a wide range of state-of-the-art LLMs under multiple prompting methodologies, including vibe coding) against 17 human-coded agents developed before the advent of LLMs. Our results over 12 double all-play-all tournaments and $\sim 40$k matches demonstrate (i) a clear superiority of human(graduate students)-coded agents: the top 5 spots are consistently won by human-coded agents, (ii) the majority of LLM-coded agents (33 out of 40) are beaten by very simple baselines, and (iii) given the best human solution as an input and prompted to improve upon, the best performing LLM makes the solution significantly worse instead of improving it. Our results highlight a gap in LLMs' ability to produce code that works competitively in the real-world, and motivate new evaluations that emphasize reasoning-driven code synthesis in real-world scenarios.

Can Vibe Coding Beat Graduate CS Students? An LLM vs. Human Coding Tournament on Market-driven Strategic Planning

TL;DR

This paper introduces the Auction, Pickup, and Delivery Problem (APDP), a real-world, multi-agent benchmark that requires LLM-generated code to bid strategically in a market and solve constrained vehicle-routing tasks to maximize profit. It systematically compares 40 LLM-coded agents against 17 human-coded agents across 12 double all-play-all tournaments on four topology graphs, revealing that graduate-student–coded agents consistently outperform the LLM-based solutions. Notably, the majority of LLM agents are beaten by simple baselines, and even the best LLM can degrade a strong human solution when asked to improve it, indicating gaps in reasoning and planning capabilities beyond code generation. The findings argue for new evaluations that stress reasoning-driven code synthesis in open-world, multi-agent settings and advocate open-source, open benchmarks to push the development of more capable, robust code-generation systems.

Abstract

The rapid proliferation of Large Language Models (LLMs) has revolutionized AI-assisted code generation. This rapid development of LLMs has outpaced our ability to properly benchmark them. Prevailing benchmarks emphasize unit-test pass rates and syntactic correctness. Such metrics understate the difficulty of many real-world problems that require planning, optimization, and strategic interaction. We introduce a multi-agent reasoning-driven benchmark based on a real-world logistics optimization problem (Auction, Pickup, and Delivery Problem) that couples competitive auctions with capacity-constrained routing. The benchmark requires building agents that can (i) bid strategically under uncertainty and (ii) optimize planners that deliver tasks while maximizing profit. We evaluate 40 LLM-coded agents (by a wide range of state-of-the-art LLMs under multiple prompting methodologies, including vibe coding) against 17 human-coded agents developed before the advent of LLMs. Our results over 12 double all-play-all tournaments and k matches demonstrate (i) a clear superiority of human(graduate students)-coded agents: the top 5 spots are consistently won by human-coded agents, (ii) the majority of LLM-coded agents (33 out of 40) are beaten by very simple baselines, and (iii) given the best human solution as an input and prompted to improve upon, the best performing LLM makes the solution significantly worse instead of improving it. Our results highlight a gap in LLMs' ability to produce code that works competitively in the real-world, and motivate new evaluations that emphasize reasoning-driven code synthesis in real-world scenarios.

Paper Structure

This paper contains 26 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: Traditional benchmarks (top) focus on problems with clearly defined correct or incorrect solutions, typically verified through unit tests. In contrast, our benchmark (bottom) involves complex tasks such as planning, constraint optimization, modeling competitors, competitive strategy design, and advanced algorithm development — challenges that remain highly non-trivial even for experienced software engineers. Top from jainlivecodebench (left) and huynh2025large (right).
  • Figure 2: The Auction, Pickup, and Delivery Problem (APDP). Task: Logistic operations optimization. Goal: Maximize profit for a transportation company. Multiple transportation companies (agents) compete in a market. Each company owns several vehicles that deliver tasks (e.g., parcels) in a given network. Tasks are sold via a reverse first-price sealed-bid auction, i.e., a company's bid corresponds to the amount of money they want to be paid to deliver the task. Higher bids mean more revenue, but bidding too high may result in not getting the auctioned task. A competitive bid depends on (i) the marginal cost of adding the auctioned task to the partial delivery plan, given the already won tasks, (ii) the marginal cost of the opponent, and (iii) other strategic decisions like incurring a loss (bid below your marginal cost) at the beginning in order to reduce the cost of future tasks (better positioning in the market). After the auction is complete, the company has to determine a plan for its vehicles such that all tasks won by the company are delivered and the total revenue of the company is maximized. The plan is a sequence of pickup and delivery actions, such that vehicle capacity constraints are satisfied. The total revenue of the company is defined as the sum of rewards (won bids paid out by the auction house) minus delivery cost (kilometers driven times cost per kilometer). It is, thus, necessary to bid optimally and compute efficient delivery plans.
  • Figure 3: Network Topologies. From top to bottom and left to right we have: Great Britain, Switzerland, the Netherlands, and France. Colored triangles represent vehicles.