Feature-Driven End-To-End Test Generation
Parsa Alian, Noor Nashid, Mobina Shahbandeh, Taha Shabani, Ali Mesbah
TL;DR
This work formalizes feature-driven end-to-end (E2E) testing and introduces AutoE2E, an LLM-enabled framework that automatically infers app features from observed states, constructs feature-centric action sequences, and converts them into executable E2E tests. It couples a probabilistic feature-inference model with a BFS exploration loop and two interlinked databases (FD and AFD) to aggregate evidence and update feature confidence, using CoT prompts and rank-based scoring to rank features. To evaluate the approach, the authors create E2EBench, a benchmark of 8 open-source web apps with ground-truth feature grammars, enabling automatic measurement of feature coverage. AutoE2E achieves an average feature coverage of 79% on E2EBench, substantially outperforming baselines such as Crawljax, WebCanvas, and BrowserGym, demonstrating the feasibility and effectiveness of feature-driven E2E test generation for web applications. The work also discusses limitations (one-to-one feature-test mapping, assertion strategies) and outlines avenues for extending the framework to broader platforms and richer coverage, aided by the public release of AutoE2E and E2EBench.
Abstract
End-to-end (E2E) testing is essential for ensuring web application quality. However, manual test creation is time-consuming, and current test generation techniques produce incoherent tests. In this paper, we present AutoE2E, a novel approach that leverages Large Language Models (LLMs) to automate the generation of semantically meaningful feature-driven E2E test cases for web applications. AutoE2E intelligently infers potential features within a web application and translates them into executable test scenarios. Furthermore, we address a critical gap in the research community by introducing E2EBench, a new benchmark for automatically assessing the feature coverage of E2E test suites. Our evaluation on E2EBench demonstrates that AutoE2E achieves an average feature coverage of 79%, outperforming the best baseline by 558%, highlighting its effectiveness in generating high-quality, comprehensive test cases.
