Table of Contents
Fetching ...

How Execution Features Relate to Failures: An Empirical Study and Diagnosis Approach

Marius Smytzek, Martin Eberlein, Lars Grunske, Andreas Zeller

TL;DR

Fault localization has often relied on line coverage; this paper argues that broader execution features can improve diagnostic power. It empirically analyzes 17 execution features across 310 bugs from 20 projects and introduces EFDD, which learns features from executions and trains a decision tree to generate interpretable diagnoses of failures. Data-flow features like def-use pairs and scalar pairs show strong correlations with failures, with correlation quantified by Spearman's $\rho$, and multi-feature fusion improves localization. Evaluation reports high predictive accuracy (approximately 89% overall) and practical runtimes, demonstrating that interpretable, feature-driven diagnoses can significantly aid developers in debugging and enabling automated repair workflows.

Abstract

Fault localization is a fundamental aspect of debugging, aiming to identify code regions likely responsible for failures. Traditional techniques primarily correlate statement execution with failures, yet program behavior is influenced by diverse execution features-such as variable values, branch conditions, and definition-use pairs-that can provide richer diagnostic insights. In an empirical study of 310 bugs across 20 projects, we analyzed 17 execution features and assessed their correlation with failure outcomes. Our findings suggest that fault localization benefits from a broader range of execution features: (1) Scalar pairs exhibit the strongest correlation with failures; (2) Beyond line executions, def-use pairs and functions executed are key indicators for fault localization; and (3) Combining multiple features enhances effectiveness compared to relying solely on individual features. Building on these insights, we introduce a debugging approach to diagnose failure circumstances. The approach extracts fine-grained execution features and trains a decision tree to differentiate passing and failing runs. From this model, we derive a diagnosis that pinpoints faulty locations and explains the underlying causes of the failure. Our evaluation demonstrates that the generated diagnoses achieve high predictive accuracy, reinforcing their reliability. These interpretable diagnoses empower developers to efficiently debug software by providing deeper insights into failure causes.

How Execution Features Relate to Failures: An Empirical Study and Diagnosis Approach

TL;DR

Fault localization has often relied on line coverage; this paper argues that broader execution features can improve diagnostic power. It empirically analyzes 17 execution features across 310 bugs from 20 projects and introduces EFDD, which learns features from executions and trains a decision tree to generate interpretable diagnoses of failures. Data-flow features like def-use pairs and scalar pairs show strong correlations with failures, with correlation quantified by Spearman's , and multi-feature fusion improves localization. Evaluation reports high predictive accuracy (approximately 89% overall) and practical runtimes, demonstrating that interpretable, feature-driven diagnoses can significantly aid developers in debugging and enabling automated repair workflows.

Abstract

Fault localization is a fundamental aspect of debugging, aiming to identify code regions likely responsible for failures. Traditional techniques primarily correlate statement execution with failures, yet program behavior is influenced by diverse execution features-such as variable values, branch conditions, and definition-use pairs-that can provide richer diagnostic insights. In an empirical study of 310 bugs across 20 projects, we analyzed 17 execution features and assessed their correlation with failure outcomes. Our findings suggest that fault localization benefits from a broader range of execution features: (1) Scalar pairs exhibit the strongest correlation with failures; (2) Beyond line executions, def-use pairs and functions executed are key indicators for fault localization; and (3) Combining multiple features enhances effectiveness compared to relying solely on individual features. Building on these insights, we introduce a debugging approach to diagnose failure circumstances. The approach extracts fine-grained execution features and trains a decision tree to differentiate passing and failing runs. From this model, we derive a diagnosis that pinpoints faulty locations and explains the underlying causes of the failure. Our evaluation demonstrates that the generated diagnoses achieve high predictive accuracy, reinforcing their reliability. These interpretable diagnoses empower developers to efficiently debug software by providing deeper insights into failure causes.

Paper Structure

This paper contains 31 sections, 1 equation, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Statistical fault localization jones2002visualization. The middle() function takes three values and returns the one that is neither the single smallest nor the single largest one; execution of Line 6 correlates most with the failure.
  • Figure 2: Collecting of Features. For the middle() function, we derive features from the execution. For example, we show the collection of executed lines, branches, definition-use pairs, scalar pairs, and conditions. We show when a feature is collected by assigning it to the corresponding line in the code. The values in parentheses indicate the parameters of a feature; for lines, the line number; for branches, the branch ID; for definition-use pairs, the variable and the lines for the definition and use; for scalar pairs, the two variables, and the applied comparison; for conditions, the actual condition.
  • Figure 3: Suspiciousness. Results of the suspiciousness that a feature correlates with the presence of failures for the metrics , , , , and . Each feature is evaluated according to the best feature of each class for each subject.
  • Figure 4: at a glance. takes a program and a set of labeled test cases as input, instruments the program, executes the test cases, and captures an execution trace. From this trace, it constructs execution features to train a decision tree. The resulting model offers an interpretable diagnosis for the observed fault.
  • Figure 5: The decision tree generated by for the middle() example. Each node represents a decision, leading to a classification of either ✔ (Pass) or ✘ (Fail).