Table of Contents
Fetching ...

Acceptance Test Generation with Large Language Models: An Industrial Case Study

Margarida Ferreira, Luis Viegas, Joao Pascoal Faria, Bruno Lima

TL;DR

The paper investigates using large language models to automate acceptance testing in an industrial setting by decoupling the process into two steps: AutoUAT generates Gherkin acceptance scenarios from user stories, and Test Flow converts these scenarios into executable Cypress scripts. Evaluated in a real-world automotive context, AutoUAT achieved 95% usefulness of generated scenarios, while Test Flow produced usable Cypress scripts in 60% of cases initially and 92% after refinements with additional context. The work presents a practical pipeline that supports ATDD/BDD with human-in-the-loop oversight, integrating with existing workflows via Azure-hosted models, Power Platform, JIRA, and CI/CD. The findings suggest that LLM-assisted acceptance testing can reduce developer workload and improve test quality, though broader studies are needed to assess long-term productivity, cost, and applicability across domains and architectures. The study contributes AutoUAT and Test Flow as industrially integrated tools and provides empirical insights into the benefits and challenges of AI-assisted acceptance testing.

Abstract

Large language model (LLM)-powered assistants are increasingly used for generating program code and unit tests, but their application in acceptance testing remains underexplored. To help address this gap, this paper explores the use of LLMs for generating executable acceptance tests for web applications through a two-step process: (i) generating acceptance test scenarios in natural language (in Gherkin) from user stories, and (ii) converting these scenarios into executable test scripts (in Cypress), knowing the HTML code of the pages under test. This two-step approach supports acceptance test-driven development, enhances tester control, and improves test quality. The two steps were implemented in the AutoUAT and Test Flow tools, respectively, powered by GPT-4 Turbo, and integrated into a partner company's workflow and evaluated on real-world projects. The users found the acceptance test scenarios generated by AutoUAT helpful 95% of the time, even revealing previously overlooked cases. Regarding Test Flow, 92% of the acceptance test cases generated by Test Flow were considered helpful: 60% were usable as generated, 8% required minor fixes, and 24% needed to be regenerated with additional inputs; the remaining 8% were discarded due to major issues. These results suggest that LLMs can,in fact, help improve the acceptance test process with appropriate tooling and supervision.

Acceptance Test Generation with Large Language Models: An Industrial Case Study

TL;DR

The paper investigates using large language models to automate acceptance testing in an industrial setting by decoupling the process into two steps: AutoUAT generates Gherkin acceptance scenarios from user stories, and Test Flow converts these scenarios into executable Cypress scripts. Evaluated in a real-world automotive context, AutoUAT achieved 95% usefulness of generated scenarios, while Test Flow produced usable Cypress scripts in 60% of cases initially and 92% after refinements with additional context. The work presents a practical pipeline that supports ATDD/BDD with human-in-the-loop oversight, integrating with existing workflows via Azure-hosted models, Power Platform, JIRA, and CI/CD. The findings suggest that LLM-assisted acceptance testing can reduce developer workload and improve test quality, though broader studies are needed to assess long-term productivity, cost, and applicability across domains and architectures. The study contributes AutoUAT and Test Flow as industrially integrated tools and provides empirical insights into the benefits and challenges of AI-assisted acceptance testing.

Abstract

Large language model (LLM)-powered assistants are increasingly used for generating program code and unit tests, but their application in acceptance testing remains underexplored. To help address this gap, this paper explores the use of LLMs for generating executable acceptance tests for web applications through a two-step process: (i) generating acceptance test scenarios in natural language (in Gherkin) from user stories, and (ii) converting these scenarios into executable test scripts (in Cypress), knowing the HTML code of the pages under test. This two-step approach supports acceptance test-driven development, enhances tester control, and improves test quality. The two steps were implemented in the AutoUAT and Test Flow tools, respectively, powered by GPT-4 Turbo, and integrated into a partner company's workflow and evaluated on real-world projects. The users found the acceptance test scenarios generated by AutoUAT helpful 95% of the time, even revealing previously overlooked cases. Regarding Test Flow, 92% of the acceptance test cases generated by Test Flow were considered helpful: 60% were usable as generated, 8% required minor fixes, and 24% needed to be regenerated with additional inputs; the remaining 8% were discarded due to major issues. These results suggest that LLMs can,in fact, help improve the acceptance test process with appropriate tooling and supervision.

Paper Structure

This paper contains 29 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Acceptance Test Generation with AutoUAT and Test Flow.
  • Figure 2: Flow Diagram of Test Flow.
  • Figure 3: GitHub Action for Integrating Test Flow into Current Workflows.
  • Figure 4: Workshop Participants' Feedback on AutoUAT Adoption and Benefits.
  • Figure 5: Procedure for evaluating and classifying the test cases generated with Test Flow. The arrows are labelled with the number of test cases that adhered to the corresponding path in our experiment.
  • ...and 2 more figures