Table of Contents
Fetching ...

On the Soundness and Consistency of LLM Agents for Executing Test Cases Written in Natural Language

Sébastien Salva, Redha Taguelmimt

TL;DR

The paper evaluates executing natural-language GUI test cases with LLM agents, highlighting inherent unsoundness and potential inconsistency in repeated runs. It introduces a guardrail-based execution algorithm that augments test cases with internal readiness and observe steps and employs specialized agents for navigation, readiness, and assertions, all under formal notions like IOLTS and ioco. Three measures assess LLM capabilities (navigation, readiness, assertion) and a consistency metric estimates stability across runs, while weak unsoundness and a Six Sigma framework provide tolerance for real-world uncertainties. Experimental results across eight local LLMs show that large models (notably Llama 3.1 70B) can achieve high execution consistency, though many models still struggle with navigation and precise assertion evaluation, underscoring the need for improved tooling and prompts. The work offers prototype tools, test suites, and a concrete pathway toward robust NL-to-GUI testing with guidance on soundness, consistency, and future research directions.

Abstract

The use of natural language (NL) test cases for validating graphical user interface (GUI) applications is emerging as a promising direction to manually written executable test scripts, which are costly to develop and difficult to maintain. Recent advances in large language models (LLMs) have opened the possibility of the direct execution of NL test cases by LLM agents. This paper investigates this direction, focusing on the impact on NL test case unsoundness and on test case execution consistency. NL test cases are inherently unsound, as they may yield false failures due to ambiguous instructions or unpredictable agent behaviour. Furthermore, repeated executions of the same NL test case may lead to inconsistent outcomes, undermining test reliability. To address these challenges, we propose an algorithm for executing NL test cases with guardrail mechanisms and specialised agents that dynamically verify the correct execution of each test step. We introduce measures to evaluate the capabilities of LLMs in test execution and one measure to quantify execution consistency. We propose a definition of weak unsoundness to characterise contexts in which NL test case execution remains acceptable, with respect to the industrial quality levels Six Sigma. Our experimental evaluation with eight publicly available LLMs, ranging from 3B to 70B parameters, demonstrates both the potential and current limitations of current LLM agents for GUI testing. Our experiments show that Meta Llama 3.1 70B demonstrates acceptable capabilities in NL test case execution with high execution consistency (above the level 3-sigma). We provide prototype tools, test suites, and results.

On the Soundness and Consistency of LLM Agents for Executing Test Cases Written in Natural Language

TL;DR

The paper evaluates executing natural-language GUI test cases with LLM agents, highlighting inherent unsoundness and potential inconsistency in repeated runs. It introduces a guardrail-based execution algorithm that augments test cases with internal readiness and observe steps and employs specialized agents for navigation, readiness, and assertions, all under formal notions like IOLTS and ioco. Three measures assess LLM capabilities (navigation, readiness, assertion) and a consistency metric estimates stability across runs, while weak unsoundness and a Six Sigma framework provide tolerance for real-world uncertainties. Experimental results across eight local LLMs show that large models (notably Llama 3.1 70B) can achieve high execution consistency, though many models still struggle with navigation and precise assertion evaluation, underscoring the need for improved tooling and prompts. The work offers prototype tools, test suites, and a concrete pathway toward robust NL-to-GUI testing with guidance on soundness, consistency, and future research directions.

Abstract

The use of natural language (NL) test cases for validating graphical user interface (GUI) applications is emerging as a promising direction to manually written executable test scripts, which are costly to develop and difficult to maintain. Recent advances in large language models (LLMs) have opened the possibility of the direct execution of NL test cases by LLM agents. This paper investigates this direction, focusing on the impact on NL test case unsoundness and on test case execution consistency. NL test cases are inherently unsound, as they may yield false failures due to ambiguous instructions or unpredictable agent behaviour. Furthermore, repeated executions of the same NL test case may lead to inconsistent outcomes, undermining test reliability. To address these challenges, we propose an algorithm for executing NL test cases with guardrail mechanisms and specialised agents that dynamically verify the correct execution of each test step. We introduce measures to evaluate the capabilities of LLMs in test execution and one measure to quantify execution consistency. We propose a definition of weak unsoundness to characterise contexts in which NL test case execution remains acceptable, with respect to the industrial quality levels Six Sigma. Our experimental evaluation with eight publicly available LLMs, ranging from 3B to 70B parameters, demonstrates both the potential and current limitations of current LLM agents for GUI testing. Our experiments show that Meta Llama 3.1 70B demonstrates acceptable capabilities in NL test case execution with high execution consistency (above the level 3-sigma). We provide prototype tools, test suites, and results.

Paper Structure

This paper contains 17 sections, 2 theorems, 8 equations, 10 figures, 3 tables, 1 algorithm.

Key Result

Proposition 2

Let $tc= a_1 \dots a_k \mathcal{A}_{k+1} \dots \mathcal{A}_l$ be a NL test case such that there exists $?a_1!g_1 \dots ?a_k!g_k \in traces(S)$, and $\mathcal{A}_{j} (k+1\leq j\leq l)$ are true on $g_k$, and $tc$ is made up of atomic actions and assertions. If $(\sigma(agent_{nav})<3\sigma$, $\sigma(

Figures (10)

  • Figure 1: Scenario to search for ARTEMIS project news.
  • Figure 2: Step-by-step test case to verify the presence of links containing the term ‘ARTEMIS’.
  • Figure 3: Automated Selenium test implementing the test case of Figure \ref{['fig:tc']}.
  • Figure 4: Illustration of the test case execution algorithm. (a) Navigation actions are injected with internal actions, readiness and observe. (b) Assetions are incrementally evaluated to determine the final verdict.
  • Figure 5: Ability of LLM agent to perform readiness actions, navigation actions and assertions measured as mean accuracies with TestG and TestA over a batch of 20 runs (Part 1)
  • ...and 5 more figures

Theorems & Definitions (6)

  • Definition 1: Weak Unsoundness
  • Proposition 2
  • Proposition 3
  • Definition 4
  • Definition 5: $ioco$ implementation relation
  • Definition 6: Weak Unsoundness