Table of Contents
Fetching ...

An Anatomy of 488 Faults from Defects4J Based on the Control- and Data-Flow Graph Representations of Programs

Alexandra van der Spuy, Bernd Fischer

TL;DR

The paper tackles the problem of characterizing real faults beyond repair-based labels by introducing a flow-graph-based taxonomy grounded in program structure. It defines a combined flow graph G=(V,E_C,E_D) and eight fault classes (six control-flow and two data-flow), then manually labels 488 faults from seven Defects4J projects. Findings show most faults map to 1–3 classes, with data-flow definition faults being the most frequent and control-flow features still widespread, suggesting the taxonomy captures intrinsic fault aspects beyond repairs. The dataset and taxonomy enable targeted evaluation of fault localization and automated repair techniques and are extensible to other languages and datasets.

Abstract

Software fault datasets such as Defects4J provide for each individual fault its location and repair, but do not characterize the faults. Current classifications use the repairs as proxies, but these do not capture the intrinsic nature of the fault. In this paper, we propose a new, direct fault classification scheme based on the control- and data-flow graph representations of programs. Our scheme comprises six control-flow and two data-flow fault classes. We manually apply this scheme to 488 faults from seven projects in the Defects4J dataset. We find that the majority of the faults are assigned between one and three classes. We also find that one of the data-flow fault classes (definition fault) is the most common individual class but that the majority of faults are classified with at least one control-flow fault class. Our proposed classification can be applied to other fault datasets and can be used to improve fault localization and automated program repair techniques for specific fault classes.

An Anatomy of 488 Faults from Defects4J Based on the Control- and Data-Flow Graph Representations of Programs

TL;DR

The paper tackles the problem of characterizing real faults beyond repair-based labels by introducing a flow-graph-based taxonomy grounded in program structure. It defines a combined flow graph G=(V,E_C,E_D) and eight fault classes (six control-flow and two data-flow), then manually labels 488 faults from seven Defects4J projects. Findings show most faults map to 1–3 classes, with data-flow definition faults being the most frequent and control-flow features still widespread, suggesting the taxonomy captures intrinsic fault aspects beyond repairs. The dataset and taxonomy enable targeted evaluation of fault localization and automated repair techniques and are extensible to other languages and datasets.

Abstract

Software fault datasets such as Defects4J provide for each individual fault its location and repair, but do not characterize the faults. Current classifications use the repairs as proxies, but these do not capture the intrinsic nature of the fault. In this paper, we propose a new, direct fault classification scheme based on the control- and data-flow graph representations of programs. Our scheme comprises six control-flow and two data-flow fault classes. We manually apply this scheme to 488 faults from seven projects in the Defects4J dataset. We find that the majority of the faults are assigned between one and three classes. We also find that one of the data-flow fault classes (definition fault) is the most common individual class but that the majority of faults are classified with at least one control-flow fault class. Our proposed classification can be applied to other fault datasets and can be used to improve fault localization and automated program repair techniques for specific fault classes.

Paper Structure

This paper contains 22 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: The distribution of number of fault classes assigned per fault, per project and overall
  • Figure 2: Co-occurrence matrices between classes (our fault classes, described in §\ref{['sec:fgbc']}; repair classes Defects4JDissection); two classes co-occur for a fault if and only if both classes are assigned to the fault