app.build: A Production Framework for Scaling Agentic Prompt-to-App Generation with Environment Scaffolding
Evgenii Kniazev, Arseny Kravchenko, Igor Rekun, James Broadhead, Nikita Shamgunov, Pranav Sah, Pratik Nichite, Ivan Yamshchikov
TL;DR
app.build introduces Environment Scaffolding (ES) as an environment-first framework for production-grade LLM-powered app generation. By coupling structured task decomposition with multi-layered, per-step validation and runtime isolation, ES enables model-agnostic repair loops that improve reliability while enabling cost-aware model selection. Across 300 automated experiments and 30 human evaluations on 30 prompts, ES demonstrates that comprehensive validation can achieve high viability (e.g., $V=1$ in 73.3% of cases) and meaningful quality (mean $Q$ around 8.8 for viable apps), with open-weight models offering substantial cost reductions (e.g., $0.61$ per viable app) relative to closed models. Production deployment since 2025 shows real-world adoption (thousands of apps generated), supporting the claim that environment design, not just model capability, drives production reliability. The work provides complete reference implementations, promoting reproducibility and practical adoption in production-oriented agent systems.
Abstract
We present app.build (https://github.com/neondatabase/appdotbuild-agent), an open-source framework that improves LLM-based application generation through systematic validation and structured environments. Our approach combines multi-layered validation pipelines, stack-specific orchestration, and model-agnostic architecture, implemented across three reference stacks. Through evaluation on 30 generation tasks, we demonstrate that comprehensive validation achieves 73.3% viability rate with 30% reaching perfect quality scores, while open-weights models achieve 80.8% of closed-model performance when provided structured environments. The open-source framework has been adopted by the community, with over 3,000 applications generated to date. This work demonstrates that scaling reliable AI agents requires scaling environments, not just models -- providing empirical insights and complete reference implementations for production-oriented agent systems.
