Table of Contents
Fetching ...

From Prompt to Product: A Human-Centered Benchmark of Agentic App Generation Systems

Marcos Ortiz, Justin Hill, Collin Overbay, Ingrida Semenec, Frederic Sauve-Hoover, Jim Schwoebel, Joel Shor

TL;DR

The paper tackles the challenge of evaluating agentic prompt-to-app systems by introducing a human-centered benchmark that combines automated checks with task-based human evaluation. It systematically compares Replit, Bolt, and Firebase Studio across 96 prompts, generating 288 artifacts and engaging 205 participants in both isolated and side-by-side assessments. The results reveal a clear hierarchy in head-to-head testing, with Firebase Studio outperforming competitors across ease of use, trust, and visual quality, while isolated assessments understate these differences. The work provides a publicly available benchmark framework, prompt set, and generated artifacts to enable reproducible evaluation and guide future research in agentic application generation.

Abstract

Agentic AI systems capable of generating full-stack web applications from natural language prompts ("prompt- to-app") represent a significant shift in software development. However, evaluating these systems remains challenging, as visual polish, functional correctness, and user trust are often misaligned. As a result, it is unclear how existing prompt-to-app tools compare under realistic, human-centered evaluation criteria. In this paper, we introduce a human-centered benchmark for evaluating prompt-to-app systems and conduct a large-scale comparative study of three widely used platforms: Replit, Bolt, and Firebase Studio. Using a diverse set of 96 prompts spanning common web application tasks, we generate 288 unique application artifacts. We evaluate these systems through a large-scale human-rater study involving 205 participants and 1,071 quality-filtered pairwise comparisons, assessing task-based ease of use, visual appeal, perceived completeness, and user trust. Our results show that these systems are not interchangeable: Firebase Studio consistently outperforms competing platforms across all human-evaluated dimensions, achieving the highest win rates for ease of use, trust, visual appeal, and visual appropriateness. Bolt performs competitively on visual appeal but trails Firebase on usability and trust, while Replit underperforms relative to both across most metrics. These findings highlight a persistent gap between visual polish and functional reliability in prompt-to-app systems and demonstrate the necessity of interactive, task-based evaluation. We release our benchmark framework, prompt set, and generated artifacts to support reproducible evaluation and future research in agentic application generation.

From Prompt to Product: A Human-Centered Benchmark of Agentic App Generation Systems

TL;DR

The paper tackles the challenge of evaluating agentic prompt-to-app systems by introducing a human-centered benchmark that combines automated checks with task-based human evaluation. It systematically compares Replit, Bolt, and Firebase Studio across 96 prompts, generating 288 artifacts and engaging 205 participants in both isolated and side-by-side assessments. The results reveal a clear hierarchy in head-to-head testing, with Firebase Studio outperforming competitors across ease of use, trust, and visual quality, while isolated assessments understate these differences. The work provides a publicly available benchmark framework, prompt set, and generated artifacts to enable reproducible evaluation and guide future research in agentic application generation.

Abstract

Agentic AI systems capable of generating full-stack web applications from natural language prompts ("prompt- to-app") represent a significant shift in software development. However, evaluating these systems remains challenging, as visual polish, functional correctness, and user trust are often misaligned. As a result, it is unclear how existing prompt-to-app tools compare under realistic, human-centered evaluation criteria. In this paper, we introduce a human-centered benchmark for evaluating prompt-to-app systems and conduct a large-scale comparative study of three widely used platforms: Replit, Bolt, and Firebase Studio. Using a diverse set of 96 prompts spanning common web application tasks, we generate 288 unique application artifacts. We evaluate these systems through a large-scale human-rater study involving 205 participants and 1,071 quality-filtered pairwise comparisons, assessing task-based ease of use, visual appeal, perceived completeness, and user trust. Our results show that these systems are not interchangeable: Firebase Studio consistently outperforms competing platforms across all human-evaluated dimensions, achieving the highest win rates for ease of use, trust, visual appeal, and visual appropriateness. Bolt performs competitively on visual appeal but trails Firebase on usability and trust, while Replit underperforms relative to both across most metrics. These findings highlight a persistent gap between visual polish and functional reliability in prompt-to-app systems and demonstrate the necessity of interactive, task-based evaluation. We release our benchmark framework, prompt set, and generated artifacts to support reproducible evaluation and future research in agentic application generation.

Paper Structure

This paper contains 65 sections, 1 equation, 17 figures, 13 tables.

Figures (17)

  • Figure 1: To evaluate prompt-to-app systems, we curated a list of prompts and used each system to generate App Artifacts. Those artifacts were then deployed, when possible, as live web applications. Then, human raters were asked to evaluate individual apps, as well as make side by side comparisons between apps generated by different systems using the same prompt.
  • Figure 2: Screenshots from the survey. Left) Isolated app evaluation (the example is from the tutorial). Center) Questions in isolated evaluation portion. Right) Comparison evaluation, presentation and specific questions.
  • Figure 3: Combined Linear Mixed-Effects Model scores for Clarity and Ease metrics across all platforms. Scores represent ratings on a 1--5 scale, adjusted for participant bias, prompt difficulty, and position effects. Point plots with error bars show 95% confidence intervals.
  • Figure 4: Platform performance differences from baseline (Bolt) measured using two independent statistical methodologies: Bradley-Terry tournament rankings (left) and Linear Mixed-Effects Models (right). Pairwise comparison results illustrate statistically significant differences in nearly all platform-dimension combinations. During isolated rating stages, a significant difference is only detected for one platform-dimension combination. This suggests that the pairwise comparison methodology is more sensitive to detecting user preferences than the isolated rating methodology. Error bars show 95% confidence intervals. Asterisk (*) indicates statistical significance ($p < 0.05$).
  • Figure 5: Effect sizes measured using Cliff's Delta between platform pairs across comparison metrics. Effect size interpretation: $|\Delta| < 0.147$ (negligible), $0.147 \leq |\Delta| < 0.33$ (small), $0.33 \leq |\Delta| < 0.474$ (medium), $|\Delta| \geq 0.474$ (large). Color intensity indicates effect magnitude.
  • ...and 12 more figures