Table of Contents
Fetching ...

Validating Formal Specifications with LLM-generated Test Cases

Alcino Cunha, Nuno Macedo

TL;DR

This paper tackles validating formal Alloy specifications by automatically generating executable test cases from natural-language requirements using large language models. It empirically evaluates GPT-5 (and other models) with three prompt styles on four Alloy4Fun domain models, measuring syntactic correctness, executability, and ability to detect incorrect specifications. Results show GPT-5 with a few-shot prompt yields high-quality, diverse test suites that are largely syntactically valid and effective at exposing wrong specifications, with non-determinism having limited impact and other models achieving competitive but weaker performance. The work demonstrates practical benefits for validation in formal methods and education, and points to future enhancements such as syntax repair post-processing and extending the approach to temporal or behavioral specifications.

Abstract

Validation is a central activity when developing formal specifications. Similarly to coding, a possible validation technique is to define upfront test cases or scenarios that a future specification should satisfy or not. Unfortunately, specifying such test cases is burdensome and error prone, which could cause users to skip this validation task. This paper reports the results of an empirical evaluation of using pre-trained large language models (LLMs) to automate the generation of test cases from natural language requirements. In particular, we focus on test cases for structural requirements of simple domain models formalized in the Alloy specification language. Our evaluation focuses on the state-of-art GPT-5 model, but results from other closed- and open-source LLMs are also reported. The results show that, in this context, GPT-5 is already quite effective at generating positive (and negative) test cases that are syntactically correct and that satisfy (or not) the given requirement, and that can detect many wrong specifications written by humans.

Validating Formal Specifications with LLM-generated Test Cases

TL;DR

This paper tackles validating formal Alloy specifications by automatically generating executable test cases from natural-language requirements using large language models. It empirically evaluates GPT-5 (and other models) with three prompt styles on four Alloy4Fun domain models, measuring syntactic correctness, executability, and ability to detect incorrect specifications. Results show GPT-5 with a few-shot prompt yields high-quality, diverse test suites that are largely syntactically valid and effective at exposing wrong specifications, with non-determinism having limited impact and other models achieving competitive but weaker performance. The work demonstrates practical benefits for validation in formal methods and education, and points to future enhancements such as syntax repair post-processing and extending the approach to temporal or behavioral specifications.

Abstract

Validation is a central activity when developing formal specifications. Similarly to coding, a possible validation technique is to define upfront test cases or scenarios that a future specification should satisfy or not. Unfortunately, specifying such test cases is burdensome and error prone, which could cause users to skip this validation task. This paper reports the results of an empirical evaluation of using pre-trained large language models (LLMs) to automate the generation of test cases from natural language requirements. In particular, we focus on test cases for structural requirements of simple domain models formalized in the Alloy specification language. Our evaluation focuses on the state-of-art GPT-5 model, but results from other closed- and open-source LLMs are also reported. The results show that, in this context, GPT-5 is already quite effective at generating positive (and negative) test cases that are syntactically correct and that satisfy (or not) the given requirement, and that can detect many wrong specifications written by humans.

Paper Structure

This paper contains 18 sections, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Courses domain model in Alloy
  • Figure 2: Example test cases for "Only students can be enrolled in courses"