Table of Contents
Fetching ...

An Empirical Evaluation of Manually Created Equivalent Mutants

Philipp Straubinger, Alexander Degenhart, Gordon Fraser

TL;DR

This work investigates how humans contribute to equivalent mutants in mutation testing using the Code Defenders game. It evaluates automated equivalence detectors (TCE and TCE+) on manual mutants and analyzes how often players create equivalent mutants and how well they detect them, across ten Java classes. The study finds that humans generate equivalence in about 5.75% of mutants (roughly 7% per class), with TCE+ detecting around 41.5% of all equivalent mutants and TCE detecting up to 16.7%; however, players often misclassify equivalence in duels, with only about 35% of equivalent cases correctly identified. The results highlight a substantial gap between automated detection and human judgment, underscoring the need for improved teaching and tooling for mutation testing and equivalence reasoning, and provide a publicly available dataset to support further research.

Abstract

Mutation testing consists of evaluating how effective test suites are at detecting artificially seeded defects in the source code, and guiding the improvement of the test suites. Although mutation testing tools are increasingly adopted in practice, equivalent mutants, i.e., mutants that differ only in syntax but not semantics, hamper this process. While prior research investigated how frequently equivalent mutants are produced by mutation testing tools and how effective existing methods of detecting these equivalent mutants are, it remains unclear to what degree humans also create equivalent mutants, and how well they perform at identifying these. We therefore study these questions in the context of Code Defenders, a mutation testing game, in which players competitively produce mutants and tests. Using manual inspection as well as automated identification methods we establish that less than 10 % of manually created mutants are equivalent. Surprisingly, our findings indicate that a significant portion of developers struggle to accurately identify equivalent mutants, emphasizing the need for improved detection mechanisms and developer training in mutation testing.

An Empirical Evaluation of Manually Created Equivalent Mutants

TL;DR

This work investigates how humans contribute to equivalent mutants in mutation testing using the Code Defenders game. It evaluates automated equivalence detectors (TCE and TCE+) on manual mutants and analyzes how often players create equivalent mutants and how well they detect them, across ten Java classes. The study finds that humans generate equivalence in about 5.75% of mutants (roughly 7% per class), with TCE+ detecting around 41.5% of all equivalent mutants and TCE detecting up to 16.7%; however, players often misclassify equivalence in duels, with only about 35% of equivalent cases correctly identified. The results highlight a substantial gap between automated detection and human judgment, underscoring the need for improved teaching and tooling for mutation testing and equivalence reasoning, and provide a publicly available dataset to support further research.

Abstract

Mutation testing consists of evaluating how effective test suites are at detecting artificially seeded defects in the source code, and guiding the improvement of the test suites. Although mutation testing tools are increasingly adopted in practice, equivalent mutants, i.e., mutants that differ only in syntax but not semantics, hamper this process. While prior research investigated how frequently equivalent mutants are produced by mutation testing tools and how effective existing methods of detecting these equivalent mutants are, it remains unclear to what degree humans also create equivalent mutants, and how well they perform at identifying these. We therefore study these questions in the context of Code Defenders, a mutation testing game, in which players competitively produce mutants and tests. Using manual inspection as well as automated identification methods we establish that less than 10 % of manually created mutants are equivalent. Surprisingly, our findings indicate that a significant portion of developers struggle to accurately identify equivalent mutants, emphasizing the need for improved detection mechanisms and developer training in mutation testing.
Paper Structure (25 sections, 3 figures, 8 tables)

This paper contains 25 sections, 3 figures, 8 tables.

Figures (3)

  • Figure 1: Attacker view of Code Defenders
  • Figure 2: Equivalence duel during a game of Code Defenders
  • Figure 3: Equivalent Mutant Detection Ratios for the TCE and TCE+ techniques