MutDafny: A Mutation-Based Approach to Assess Dafny Specifications
Isabel Amaral, Alexandra Mendes, José Campos
TL;DR
This work tackles the problem of weak or underspecified Dafny specifications undermining formally verified programs. It introduces MutDafny, a Dafny plugin that performs mutation testing by mutating implementations and checking them against the surrounding specifications, aiming to reveal weaknesses in the specs. The authors compile a comprehensive set of mutation operators, combining 147 operators from traditional languages with 11 Dafny-specific operators synthesized from real bugfix commits, resulting in 32 total operators used in the tool. An extensive empirical study on 794 Dafny programs and 112,810 mutants demonstrates high overall mutation detection in specifications and identifies five real weak specifications through manual analysis, validating the approach and highlighting practical implications for improving spec quality in verification-aware languages. The work also provides a rich dataset and tooling to extend mutation-based specification assessment to Dafny and potentially other verification-oriented languages.
Abstract
This paper explores the use of mutation testing to reveal weaknesses in formal specifications written in Dafny. In verification-aware programming languages, such as Dafny, despite their critical role, specifications are as prone to errors as implementations. Flaws in specs can result in formally verified programs that deviate from the intended behavior. We present MutDafny, a tool that increases the reliability of Dafny specifications by automatically signaling potential weaknesses. Using a mutation testing approach, we introduce faults (mutations) into the code and rely on formal specifications for detecting them. If a program with a mutant verifies, this may indicate a weakness in the specification. We extensively analyze mutation operators from popular tools, identifying the ones applicable to Dafny. In addition, we synthesize new operators tailored for Dafny from bugfix commits in publicly available Dafny projects on GitHub. Drawing from both, we equipped our tool with a total of 32 mutation operators. We evaluate MutDafny's effectiveness and efficiency in a dataset of 794 real-world Dafny programs and we manually analyze a subset of the resulting undetected mutants, identifying five weak real-world specifications (on average, one at every 241 lines of code) that would benefit from strengthening.
