Efficient Mutation Testing via Pre-Trained Language Models

Ahmed Khanfir; Renzo Degiovanni; Mike Papadakis; Yves Le Traon

Efficient Mutation Testing via Pre-Trained Language Models

Ahmed Khanfir, Renzo Degiovanni, Mike Papadakis, Yves Le Traon

TL;DR

This paper tackles the limited realism of traditional mutation testing by introducing μBert, a mutation tester that leverages CodeBERT to generate natural, developer-like mutants through token masking and MLM-based predictions. It augments this with additive condition-seeding mutations to capture complex faults, and demonstrates superior fault-revelation performance and cost-effectiveness compared to PiTest across 689 Defects4J faults. The study also shows that additive mutations substantially improve detection and that μBert mutants can reveal faults missed by PiTest, highlighting complementary strengths. The approach is implemented with open tooling and evaluated under a rigorous experimental protocol, supporting practical adoption for more realistic mutation testing workflows.

Abstract

Mutation testing is an established fault-based testing technique. It operates by seeding faults into the programs under test and asking developers to write tests that reveal these faults. These tests have the potential to reveal a large number of faults -- those that couple with the seeded ones -- and thus are deemed important. To this end, mutation testing should seed faults that are both "natural" in a sense easily understood by developers and strong (have high chances to reveal faults). To achieve this we propose using pre-trained generative language models (i.e. CodeBERT) that have the ability to produce developer-like code that operates similarly, but not exactly, as the target code. This means that the models have the ability to seed natural faults, thereby offering opportunities to perform mutation testing. We realise this idea by implementing $μ$BERT, a mutation testing technique that performs mutation testing using CodeBert and empirically evaluated it using 689 faulty program versions. Our results show that the fault revelation ability of $μ$BERT is higher than that of a state-of-the-art mutation testing (PiTest), yielding tests that have up to 17% higher fault detection potential than that of PiTest. Moreover, we observe that $μ$BERT can complement PiTest, being able to detect 47 bugs missed by PiTest, while at the same time, PiTest can find 13 bugs missed by $μ$BERT.

Efficient Mutation Testing via Pre-Trained Language Models

TL;DR

Abstract

BERT, a mutation testing technique that performs mutation testing using CodeBert and empirically evaluated it using 689 faulty program versions. Our results show that the fault revelation ability of

BERT is higher than that of a state-of-the-art mutation testing (PiTest), yielding tests that have up to 17% higher fault detection potential than that of PiTest. Moreover, we observe that

BERT can complement PiTest, being able to detect 47 bugs missed by PiTest, while at the same time, PiTest can find 13 bugs missed by

BERT.

Paper Structure (24 sections, 6 figures, 5 tables)

This paper contains 24 sections, 6 figures, 5 tables.

Introduction
Background
Mutation Testing
Generative Language Models
Approach
AST Nodes Selection
Token Masking
CodeBERT-MLM prediction
Condition seeding
Using existing conditions in the same class
Using existing variables in the same class
Mutant filtering
Research Questions
Experimental Setup
Dataset & Benchmark
...and 9 more sections

Figures (6)

Figure 1: $\mu$Bert Workflow: (1) it parses the Java code given as input, and extracts the expressions to mutate; (2) it creates simple-replacement mutants by masking the tokens of interest and invoking CodeBERT; (3) it generates the mutants by replacing the masked token with CodeBERT predictions; (4) it generates complex mutants via a) conditions-seeding, b) tokens masking then c) replacing by CodeBERT predictions; and finally, (5) it discards not compiling and syntactically identical mutants.
Figure 2: Fault-detection performance improvement when using additive patterns. Comparison between $\mu$Bert and $\mu$Bert$_{conv}$, w.r.t. the fault-detection of test suites written to kill all generated mutants.
Figure 3: Fault-detection comparison between $\mu$Bert and $\mu$Bert$_{conv}$, with the same effort: where the maximum effort is limited to the minimum effort required to analyse all mutants of any of them, which is $\mu$Bert$_{conv}$ in most of the cases.
Figure 4: Fault-detection comparison between $\mu$Bert and PiTest, with the same effort: where the maximum effort is limited to the minimum effort required to analyse all mutants of any of them, which is Pit-default in most of the cases.
Figure 5: Comparison between $\mu$Bert and PiTest, relative to the fault-detection of test suites written to kill all generated mutants.
...and 1 more figures

Efficient Mutation Testing via Pre-Trained Language Models

TL;DR

Abstract

Efficient Mutation Testing via Pre-Trained Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)