Validating Formal Specifications with LLM-generated Test Cases
Alcino Cunha, Nuno Macedo
TL;DR
This paper tackles validating formal Alloy specifications by automatically generating executable test cases from natural-language requirements using large language models. It empirically evaluates GPT-5 (and other models) with three prompt styles on four Alloy4Fun domain models, measuring syntactic correctness, executability, and ability to detect incorrect specifications. Results show GPT-5 with a few-shot prompt yields high-quality, diverse test suites that are largely syntactically valid and effective at exposing wrong specifications, with non-determinism having limited impact and other models achieving competitive but weaker performance. The work demonstrates practical benefits for validation in formal methods and education, and points to future enhancements such as syntax repair post-processing and extending the approach to temporal or behavioral specifications.
Abstract
Validation is a central activity when developing formal specifications. Similarly to coding, a possible validation technique is to define upfront test cases or scenarios that a future specification should satisfy or not. Unfortunately, specifying such test cases is burdensome and error prone, which could cause users to skip this validation task. This paper reports the results of an empirical evaluation of using pre-trained large language models (LLMs) to automate the generation of test cases from natural language requirements. In particular, we focus on test cases for structural requirements of simple domain models formalized in the Alloy specification language. Our evaluation focuses on the state-of-art GPT-5 model, but results from other closed- and open-source LLMs are also reported. The results show that, in this context, GPT-5 is already quite effective at generating positive (and negative) test cases that are syntactically correct and that satisfy (or not) the given requirement, and that can detect many wrong specifications written by humans.
