MuFF: Stable and Sensitive Post-training Mutation Testing for Deep Learning

Jinhan Kim; Nargiz Humbatova; Gunel Jahangirova; Shin Yoo; Paolo Tonella

MuFF: Stable and Sensitive Post-training Mutation Testing for Deep Learning

Jinhan Kim, Nargiz Humbatova, Gunel Jahangirova, Shin Yoo, Paolo Tonella

TL;DR

MuFF tackles instability in post-training mutation testing for deep learning by introducing automated stability checks and two weight-inhibiting operators, enabling the efficient generation of killable and sensitive mutants. A binary-search framework tunes mutation parameters across multiple original models to maximize mutant quality while maintaining stability, achieving substantially higher sensitivity than existing post-training tools and orders-of-magnitude faster performance than pre-training methods. Spectral analysis reveals MuFF mutants are internally cohesive yet distinct from DeepCrime mutants, suggesting a unique and valuable mutant space. The approach maintains the practicality of post-training mutation while delivering gains in reliability, efficiency, and guidance for test-quality assessment and fault localization in DL systems.

Abstract

Rapid adoptions of Deep Learning (DL) in a broad range of fields led to the development of specialised testing techniques for DL systems, including DL mutation testing. However, existing post-training DL mutation techniques often generate unstable mutants across multiple training repetitions and multiple applications of the same mutation operator. Additionally, while extremely efficient, they generate mutants without taking into account the mutants' sensitivity and killability, resulting in a large number of ineffective mutants compared to pre-training mutants. In this paper, we present a new efficient post-training DL mutation technique, named MuFF, designed to ensure the stability of the mutants and capable of generating killable and sensitive mutants. MuFF implements an automated stability check and introduces two mutation operators, named weight and neuron inhibitors. Our extensive empirical experiments show that MuFF generates mutants with 60%pt and 25%pt higher sensitivity compared to DeepMutation++ and DeepCrime, respectively, while also producing mutants that are more stable than those of DeepMutation++ and different from the mutants of DeepCrime. Moreover, MuFF preserves the benefits of the post-training mutation technique, being 61 times faster than DeepCrime in generating mutants.

MuFF: Stable and Sensitive Post-training Mutation Testing for Deep Learning

TL;DR

Abstract

Paper Structure (33 sections, 7 equations, 4 figures, 9 tables, 1 algorithm)

This paper contains 33 sections, 7 equations, 4 figures, 9 tables, 1 algorithm.

Introduction
Background
Post-training Mutations
Pre-training Mutations
Variability and Instability of Post-training Mutations
Research Questions
RQ1 (Variability)
RQ2 (Stability)
Subjects and Configurations
Results
RQ1 (Variability)
RQ2 (Stability)
MuFF
Stable and Sensitive Mutant Generation with Binary Search
Inhibitor Mutation Operators
...and 18 more sections

Figures (4)

Figure 1: This example depicts one (a) or two (b) original DL models, $O_1$ and $O_2$, trained individually using the same source code and hyperparameters. Then, two mutants, $M_1$ and $M_2$, are generated by applying the same post-training mutation operator either to $O_1$ only, or to both $O_1$ and $O_2$.
Figure 2: Minimum number of instances for stable DeepMutation++ mutants. Each bar colour corresponds to a specific MO configuration, i.e., ratio value.
Figure 3: Impact of varying parameter values of WI's ratio on the sensitivity of the generated mutants.
Figure 4: Log-Euclidean distance distribution based on the spectrum analysis.

MuFF: Stable and Sensitive Post-training Mutation Testing for Deep Learning

TL;DR

Abstract

MuFF: Stable and Sensitive Post-training Mutation Testing for Deep Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (4)