Automatically Generating Web Applications from Requirements Via Multi-Agent Test-Driven Development
Yuxuan Wan, Tingshuo Liang, Jiakai Xu, Jingyu Xiao, Yintong Huo, Michael R. Lyu
TL;DR
TDDev presents the first TDD-enabled multi-agent framework for automatic end-to-end full-stack web development from natural language or design images. It assembles three specialized agents—test-case generation, development, and testing—into a cohesive Req-to-App workflow that iteratively refines implementations based on executable tests and user-simulation feedback. Experimental results on the multimodal Req-to-App-MM benchmark show TDDev achieves higher accuracy, lower failure rates, and competitive visual fidelity compared with baselines, while substantially reducing manual developer effort. This work demonstrates the practical potential of integrating TDD with LLM agents to automate reliable, visually faithful, full-stack web applications at scale, with clear directions for further improvements in open-source model support and scalability of feedback cycles.
Abstract
Developing full-stack web applications is complex and time-intensive, demanding proficiency across diverse technologies and frameworks. Although recent advances in multimodal large language models (MLLMs) enable automated webpage generation from visual inputs, current solutions remain limited to front-end tasks and fail to deliver fully functional applications. In this work, we introduce TDDev, the first test-driven development (TDD)-enabled LLM-agent framework for end-to-end full-stack web application generation. Given a natural language description or design image, TDDev automatically derives executable test cases, generates front-end and back-end code, simulates user interactions, and iteratively refines the implementation until all requirements are satisfied. Our framework addresses key challenges in full-stack automation, including underspecified user requirements, complex interdependencies among multiple files, and the need for both functional correctness and visual fidelity. Through extensive experiments on diverse application scenarios, TDDev achieves a 14.4% improvement on overall accuracy compared to state-of-the-art baselines, demonstrating its effectiveness in producing reliable, high-quality web applications without requiring manual intervention.
