Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development

Hung Tran; Langston Nashold; Rayan Krishnan; Antoine Bigeard; Alex Gu

Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development

Hung Tran, Langston Nashold, Rayan Krishnan, Antoine Bigeard, Alex Gu

TL;DR

A novel benchmark dataset and browser-based evaluation pipeline for end-to-end web application development, a comprehensive evaluation of 16 frontier models with cost, latency, and error analysis, and an evaluator alignment protocol with both cross-model and human annotation results are introduced.

Abstract

Code generation has emerged as one of AI's highest-impact use cases, yet existing benchmarks measure isolated tasks rather than the complete "zero-to-one" process of building a working application from scratch. We introduce Vibe Code Bench, a benchmark of 100 web application specifications (50 public validation, 50 held-out test) with 964 browser-based workflows comprising 10,131 substeps, evaluated against deployed applications by an autonomous browser agent. Across 16 frontier models, the best achieves only 58.0% accuracy on the test split, revealing that reliable end-to-end application development remains a frontier challenge. We identify self-testing during generation as a strong performance predictor (Pearson r=0.72), and show through a completed human alignment study that evaluator selection materially affects outcomes (31.8-93.6% pairwise step-level agreement). Our contributions include (1) a novel benchmark dataset and browser-based evaluation pipeline for end-to-end web application development, (2) a comprehensive evaluation of 16 frontier models with cost, latency, and error analysis, and (3) an evaluator alignment protocol with both cross-model and human annotation results.

Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development

TL;DR

Abstract

Paper Structure (65 sections, 1 equation, 6 figures, 12 tables)

This paper contains 65 sections, 1 equation, 6 figures, 12 tables.

Introduction
Motivation
The gap.
Overview of Vibe Code Bench.
Key findings.
Contributions.
Related Work
Code Generation Benchmarks.
Web Agent Benchmarks.
Agentic Coding Systems.
Benchmark Design
Data Construction
Task Format and Test Structure.
Third-Party Service Integration.
Tests
...and 50 more sections

Figures (6)

Figure 1: Generation flow from natural-language specification to a runnable application artifact.
Figure 2: Automated evaluation flow from deployed app to workflow pass/fail scoring.
Figure 3: Accuracy--cost and accuracy--latency trade-offs.
Figure 4: Application pass-rate histograms for six representative models (ranks 1, 4, 7, 10, 13, 16).
Figure 5: Trajectory timeline by model on a single application (bill_splitting_app).
...and 1 more figures

Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development

TL;DR

Abstract

Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development

Authors

TL;DR

Abstract

Table of Contents

Figures (6)