Table of Contents
Fetching ...

An Empirical Study of the Realism of Mutants in Deep Learning

Zaheed Ahmed, Philip Makedonski, Jens Grabowski

TL;DR

The paper tackles whether DL mutants realistically mirror real faults and compares pre-training versus post-training mutation approaches. It introduces a statistical framework using Coupling Strength and IoU to quantify realism across four public fault datasets, revealing that pre-training mutants generally exhibit stronger coupling and higher behavioral similarity to real faults, albeit with substantial retraining costs. The study curated 86 reproducible real faults, applied fault-informed pre-training and operator-based post-training mutations, and demonstrated that realism advantages of pre-training persist across datasets, though not universally. The findings underscore the need for efficient, high-realism mutation operators and advocate multi-dimensional fault-detection criteria to improve the credibility of mutation-based DL evaluation and debugging workflows.

Abstract

Mutation analysis is a well-established technique for assessing test quality in the traditional software development paradigm by injecting artificial faults into programs. Its application to deep learning (DL) has expanded beyond classical testing to support tasks such as fault localization, repair, data generation, and model robustness evaluation. The core assumption is that mutants behave similarly to real faults, an assumption well established in traditional software systems but largely unverified for DL. This study presents the first empirical comparison of pre-training and post-training mutation approaches in DL with respect to realism. We introduce a statistical framework to quantify their coupling strength and behavioral similarity to real faults using publicly available bugs datasets: CleanML, DeepFD, DeepLocalize, and defect4ML. Mutants are generated using state-of-the-art tools representing both approaches. Results show that pre-training mutants exhibit consistently stronger coupling and higher behavioral similarity to real faults than post-training mutants, indicating greater realism. However, the substantial computational cost of pre-training mutation underscores the need for more effective post-training operators that match or exceed the realism demonstrated by pre-training mutants.

An Empirical Study of the Realism of Mutants in Deep Learning

TL;DR

The paper tackles whether DL mutants realistically mirror real faults and compares pre-training versus post-training mutation approaches. It introduces a statistical framework using Coupling Strength and IoU to quantify realism across four public fault datasets, revealing that pre-training mutants generally exhibit stronger coupling and higher behavioral similarity to real faults, albeit with substantial retraining costs. The study curated 86 reproducible real faults, applied fault-informed pre-training and operator-based post-training mutations, and demonstrated that realism advantages of pre-training persist across datasets, though not universally. The findings underscore the need for efficient, high-realism mutation operators and advocate multi-dimensional fault-detection criteria to improve the credibility of mutation-based DL evaluation and debugging workflows.

Abstract

Mutation analysis is a well-established technique for assessing test quality in the traditional software development paradigm by injecting artificial faults into programs. Its application to deep learning (DL) has expanded beyond classical testing to support tasks such as fault localization, repair, data generation, and model robustness evaluation. The core assumption is that mutants behave similarly to real faults, an assumption well established in traditional software systems but largely unverified for DL. This study presents the first empirical comparison of pre-training and post-training mutation approaches in DL with respect to realism. We introduce a statistical framework to quantify their coupling strength and behavioral similarity to real faults using publicly available bugs datasets: CleanML, DeepFD, DeepLocalize, and defect4ML. Mutants are generated using state-of-the-art tools representing both approaches. Results show that pre-training mutants exhibit consistently stronger coupling and higher behavioral similarity to real faults than post-training mutants, indicating greater realism. However, the substantial computational cost of pre-training mutation underscores the need for more effective post-training operators that match or exceed the realism demonstrated by pre-training mutants.

Paper Structure

This paper contains 47 sections, 4 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Mutant generation with pre-training and post-training approaches
  • Figure 2: An overview of the proposed methodology. KI, KP, CS, and IoU correspond to the definitions provided in Equations (\ref{['eq:killing-input']})--(\ref{['eq:iou']}).
  • Figure 3: Experimental pipeline for evaluating mutant realism
  • Figure 4: Bug-wise coupling strength distributions across datasets. Each bug shows three scenarios: pre-training, post-training scenario 1, and post-training scenario 2. The color coding and legend shown in subfigure (a) apply consistently to all subfigures (b)--(d).
  • Figure 5: Dataset-level summary of bug-wise cases achieving the highest median coupling strength.
  • ...and 2 more figures