Re-evaluation of Logical Specification in Behavioural Verification
Radoslaw Klimek, Jakub Semczyszyn
TL;DR
The paper tackles the reliability and scalability of automated logical specification methods for behavioural models by performing a structured replication across first-order and temporal logics using a benchmark suite of eight problem classes. It systematically compares Prover9, SPASS, and InKreSAT, revealing that solver performance highly depends on problem structure and distribution of clauses, with InKreSAT delivering near-instant results and outperforming traditional FOL provers by large margins. Key contributions include a curated, reproducible benchmark and empirical insights into how clause-length distributions (including Poisson around $\lambda=3.5$) and liveness/safety ratios affect solver efficiency, along with implications for IDEs and CI/CD pipelines. The work advances empirical software engineering by clarifying when and how automated reasoning can provide real-time feedback in safety-critical verification contexts, and it provides data and scripts for public reuse.
Abstract
This study empirically validates automated logical specification methods for behavioural models, focusing on their robustness, scalability, and reproducibility. By the systematic reproduction and extension of prior results, we confirm key trends, while identifying performance irregularities that suggest the need for adaptive heuristics in automated reasoning. Our findings highlight that theorem provers exhibit varying efficiency across problem structures, with implications for real-time verification in CI/CD pipelines and AI-driven IDEs supporting on-the-fly validation. Addressing these inefficiencies through self-optimising solvers could enhance the stability of automated reasoning, particularly in safety-critical software verification.
