Real Faults in Deep Learning Fault Benchmarks: How Real Are They?
Gunel Jahangirova, Nargiz Humbatova, Jinhan Kim, Shin Yoo, Paolo Tonella
TL;DR
This study empirically interrogates real-depth fault benchmarks used for testing DL software engineering approaches. By manually analyzing 490 faults across five benchmarks, the authors define four realism criteria, assess fault-source correspondence, training-data realism, fault-type distribution, and reproducibility, and then evaluate the extent to which these faults can be reproduced and simulated by mutation tools. They find that only 58 faults meet all realism conditions and that reproducible faults represent just over half of the realistically characterized cases; moreover, only a minority of faults align with existing mutation operators. The work highlights significant limitations in current DL fault benchmarks, including representativeness, maintenance, and independence, and argues for stricter mining practices, independent benchmark creation, and closer integration with mutation-testing to advance robust evaluation of DL testing techniques.
Abstract
As the adoption of Deep Learning (DL) systems continues to rise, an increasing number of approaches are being proposed to test these systems, localise faults within them, and repair those faults. The best attestation of effectiveness for such techniques is an evaluation that showcases their capability to detect, localise and fix real faults. To facilitate these evaluations, the research community has collected multiple benchmarks of real faults in DL systems. In this work, we perform a manual analysis of 490 faults from five different benchmarks and identify that 314 of them are eligible for our study. Our investigation focuses specifically on how well the bugs correspond to the sources they were extracted from, which fault types are represented, and whether the bugs are reproducible. Our findings indicate that only 18.5% of the faults satisfy our realism conditions. Our attempts to reproduce these faults were successful only in 52% of cases.
