Can Vibe Coding Beat Graduate CS Students? An LLM vs. Human Coding Tournament on Market-driven Strategic Planning
Panayiotis Danassis, Naman Goel
TL;DR
This paper introduces the Auction, Pickup, and Delivery Problem (APDP), a real-world, multi-agent benchmark that requires LLM-generated code to bid strategically in a market and solve constrained vehicle-routing tasks to maximize profit. It systematically compares 40 LLM-coded agents against 17 human-coded agents across 12 double all-play-all tournaments on four topology graphs, revealing that graduate-student–coded agents consistently outperform the LLM-based solutions. Notably, the majority of LLM agents are beaten by simple baselines, and even the best LLM can degrade a strong human solution when asked to improve it, indicating gaps in reasoning and planning capabilities beyond code generation. The findings argue for new evaluations that stress reasoning-driven code synthesis in open-world, multi-agent settings and advocate open-source, open benchmarks to push the development of more capable, robust code-generation systems.
Abstract
The rapid proliferation of Large Language Models (LLMs) has revolutionized AI-assisted code generation. This rapid development of LLMs has outpaced our ability to properly benchmark them. Prevailing benchmarks emphasize unit-test pass rates and syntactic correctness. Such metrics understate the difficulty of many real-world problems that require planning, optimization, and strategic interaction. We introduce a multi-agent reasoning-driven benchmark based on a real-world logistics optimization problem (Auction, Pickup, and Delivery Problem) that couples competitive auctions with capacity-constrained routing. The benchmark requires building agents that can (i) bid strategically under uncertainty and (ii) optimize planners that deliver tasks while maximizing profit. We evaluate 40 LLM-coded agents (by a wide range of state-of-the-art LLMs under multiple prompting methodologies, including vibe coding) against 17 human-coded agents developed before the advent of LLMs. Our results over 12 double all-play-all tournaments and $\sim 40$k matches demonstrate (i) a clear superiority of human(graduate students)-coded agents: the top 5 spots are consistently won by human-coded agents, (ii) the majority of LLM-coded agents (33 out of 40) are beaten by very simple baselines, and (iii) given the best human solution as an input and prompted to improve upon, the best performing LLM makes the solution significantly worse instead of improving it. Our results highlight a gap in LLMs' ability to produce code that works competitively in the real-world, and motivate new evaluations that emphasize reasoning-driven code synthesis in real-world scenarios.
