BayesFLo: Bayesian fault localization of complex software systems

Yi Ji; Simon Mak; Ryan Lekivetz; Joseph Morgan

BayesFLo: Bayesian fault localization of complex software systems

Yi Ji, Simon Mak, Ryan Lekivetz, Joseph Morgan

TL;DR

BayesFLo advances fault localization for complex software by introducing a Bayesian model over all input-combination root causes that integrates domain knowledge via a product-form prior encoding combination hierarchy and heredity. It partitions test results into tested-and-passed, failed, and untested combinations and derives a tractable posterior for TF combinations using an alternate formulation linked to minimal set covers, solved via ILP and inclusion–exclusion. The approach yields probabilistic risk assessments, disentangles tied suspicious combinations, and reduces debugging costs, as demonstrated in TCAS and JMP Easy DOE case studies where true root causes are identified with high posterior probability. This probabilistic, knowledge-guided framework represents a practical enhancement over deterministic covering-array analyses, enabling more confident and cost-effective software fault diagnosis.

Abstract

Software testing is essential for the reliable development of complex software systems. A key step in software testing is fault localization, which uses test data to pinpoint failure-inducing combinations for further diagnosis. Existing fault localization methods have two key limitations: they (i) do not incorporate domain and/or structural knowledge from test engineers, and (ii) do not provide a probabilistic assessment of risk for potential root causes. Such methods can thus fail to confidently whittle down the combinatorial number of potential root causes in complex systems, resulting in prohibitively high testing costs. To address this, we propose a novel Bayesian fault localization framework called BayesFLo, which leverages a flexible Bayesian model for identifying potential root causes with probabilistic uncertainty. Using a carefully-specified prior on root cause probabilities, BayesFLo permits the integration of domain and structural knowledge via the principles of combination hierarchy and heredity, which capture the expected structure of failure-inducing combinations. We then develop new algorithms for efficient computation of posterior root cause probabilities, leveraging recent tools from integer programming and graph representations. Finally, we demonstrate the effectiveness of BayesFLo over existing methods in two fault localization case studies on the Traffic Alert and Collision Avoidance System and the JMP Easy DOE platform.

BayesFLo: Bayesian fault localization of complex software systems

TL;DR

Abstract

Paper Structure (18 sections, 2 theorems, 24 equations, 12 figures, 7 tables, 1 algorithm)

This paper contains 18 sections, 2 theorems, 24 equations, 12 figures, 7 tables, 1 algorithm.

Introduction
Motivating Application: Fault Localization of TCAS
Background & Challenges
State-of-the-Art and Its Limitations
The BayesFLo Model
Prior Specification
Posterior Root Cause Probabilities
Computation of Root Cause Probabilities
An Alternate Formulation
Enumerating Minimal Covers
Computing Root Cause Probabilities
Algorithm Summary
Case Studies
Case Study 1: Traffic Alert and Collision Avoidance System
Case Study 2: JMP Easy DOE Platform
...and 3 more sections

Key Result

Proposition 1

Let $(\mathbf{i},\mathbf{j}) \in \mathcal{C}_{\rm TF}$, and let: be the index set of failed test cases for which $(\mathbf{i},\mathbf{j})$ is a potential root cause. Define the event: In words, this is the event that all failures in $\mathcal{M}_{(\mathbf{i},\mathbf{j})}$ can be explained by the selected root causes $\{c \in \mathcal{C}_{\rm TF}: Z_c = 1\}$. The desired posterior root cause prob

Figures (12)

Figure 1: Visualizing the protection volume module in the TCAS software system TCASFAA.
Figure 1: Input factors and their corresponding levels for our motivating TCAS case study.
Figure 2: The $M=17$-run CA design and test outcomes for our motivating TCAS case study. Here, an outcome of 0 indicates a passed test case, with 1 indicating a failed test case.
Figure 2: Top suspicious combinations from the JMP Covering Array analysis for our motivating TCAS case study, with its corresponding failure counts from test runs.
Figure 3: [Left] Visualizing the use of passed and failed test cases for partitioning the set of considered combinations $\mathcal{C}$ into $\mathcal{C}_{\rm TP}$, $\mathcal{C}_{\rm TF}$ and $\mathcal{C}_{\rm UT}$. [Right] Workflow for the proposed BayesFLo fault localization approach.
...and 7 more figures

Theorems & Definitions (2)

Proposition 1
Proposition 2

BayesFLo: Bayesian fault localization of complex software systems

TL;DR

Abstract

BayesFLo: Bayesian fault localization of complex software systems

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (2)