Table of Contents
Fetching ...

An Empirical Study of Fault Localisation Techniques for Deep Learning

Nargiz Humbatova, Jinhan Kim, Gunel Jahangirova, Shin Yoo, Paolo Tonella

TL;DR

This study empirically evaluates four state-of-the-art fault localisation tools for deep neural networks on a benchmark comprising real and artificially mutated faults. A neutrality analysis is introduced to extend ground truth with alternative, equivalent patches, revealing that single-ground-truth evaluations significantly underreport FL performance. DeepFD generally offers the strongest localisation signals but at higher runtime costs, while Neuralint provides faster, static analysis-based results with competitive accuracy. Across both original and neutrality-augmented ground truths, FL remains challenging, underscoring the need for broader ground truths and more actionable localisation signals to support DL debugging in practice.

Abstract

With the increased popularity of Deep Neural Networks (DNNs), increases also the need for tools to assist developers in the DNN implementation, testing and debugging process. Several approaches have been proposed that automatically analyse and localise potential faults in DNNs under test. In this work, we evaluate and compare existing state-of-the-art fault localisation techniques, which operate based on both dynamic and static analysis of the DNN. The evaluation is performed on a benchmark consisting of both real faults obtained from bug reporting platforms and faulty models produced by a mutation tool. Our findings indicate that the usage of a single, specific ground truth (e.g., the human defined one) for the evaluation of DNN fault localisation tools results in pretty low performance (maximum average recall of 0.31 and precision of 0.23). However, such figures increase when considering alternative, equivalent patches that exist for a given faulty DNN. Results indicate that \dfd is the most effective tool, achieving an average recall of 0.61 and precision of 0.41 on our benchmark.

An Empirical Study of Fault Localisation Techniques for Deep Learning

TL;DR

This study empirically evaluates four state-of-the-art fault localisation tools for deep neural networks on a benchmark comprising real and artificially mutated faults. A neutrality analysis is introduced to extend ground truth with alternative, equivalent patches, revealing that single-ground-truth evaluations significantly underreport FL performance. DeepFD generally offers the strongest localisation signals but at higher runtime costs, while Neuralint provides faster, static analysis-based results with competitive accuracy. Across both original and neutrality-augmented ground truths, FL remains challenging, underscoring the need for broader ground truths and more actionable localisation signals to support DL debugging in practice.

Abstract

With the increased popularity of Deep Neural Networks (DNNs), increases also the need for tools to assist developers in the DNN implementation, testing and debugging process. Several approaches have been proposed that automatically analyse and localise potential faults in DNNs under test. In this work, we evaluate and compare existing state-of-the-art fault localisation techniques, which operate based on both dynamic and static analysis of the DNN. The evaluation is performed on a benchmark consisting of both real faults obtained from bug reporting platforms and faulty models produced by a mutation tool. Our findings indicate that the usage of a single, specific ground truth (e.g., the human defined one) for the evaluation of DNN fault localisation tools results in pretty low performance (maximum average recall of 0.31 and precision of 0.23). However, such figures increase when considering alternative, equivalent patches that exist for a given faulty DNN. Results indicate that \dfd is the most effective tool, achieving an average recall of 0.61 and precision of 0.41 on our benchmark.

Paper Structure

This paper contains 20 sections, 3 equations, 2 figures, 12 tables, 1 algorithm.

Figures (2)

  • Figure 1: An example neutrality network of D4
  • Figure 2: Average execution time and average performance ($F_3$ score) for each tool