Table of Contents
Fetching ...

An Empirical Study of Fault Localization in Python Programs

Mohammad Rezaalipour, Carlo A. Furia

TL;DR

This study presents a differentiated replication of fault localization findings from Java research, targeting Python with 340 BugsInPy bugs across 13 projects and employing FauxPy to run spectrum-based, mutation-based, predicate-switching, and stack-trace fault localization methods at statement, function, and module granularity. Key results show that SBFL generally outperforms MBFL and the specialized PS/ST approaches in Python, with MBFL providing complementary strengths on certain bug types (notably mutable bugs) at a higher runtime cost. Combining techniques (CombineFL and AvgFL) yields notable improvements in effectiveness, often with moderate overhead, and granularity plays a significant role, with coarser granularity improving localization performance. Across project categories, data science bugs are harder to localize, while crashes benefit from stack-trace information; Python FL results largely align with Java findings, reinforcing the viability of classical FL methods in Python and offering a rich replication package for future studies. The work provides valuable benchmarks and tools for the community, enabling deeper, reproducible investigations into Python fault localization and cross-language comparisons.

Abstract

Despite its massive popularity as a programming language, especially in novel domains like data science programs, there is comparatively little research about fault localization that targets Python. Even though it is plausible that several findings about programming languages like C/C++ and Java -- the most common choices for fault localization research -- carry over to other languages, whether the dynamic nature of Python and how the language is used in practice affect the capabilities of classic fault localization approaches remain open questions to investigate. This paper is the first multi-family large-scale empirical study of fault localization on real-world Python programs and faults. Using Zou et al.'s recent large-scale empirical study of fault localization in Java as the basis of our study, we investigated the effectiveness (i.e., localization accuracy), efficiency (i.e., runtime performance), and other features (e.g., different entity granularities) of seven well-known fault-localization techniques in four families (spectrum-based, mutation-based, predicate switching, and stack-trace based) on 135 faults from 13 open-source Python projects from the BugsInPy curated collection. The results replicate for Python several results known about Java, and shed light on whether Python's peculiarities affect the capabilities of fault localization. The replication package that accompanies this paper includes detailed data about our experiments, as well as the tool FauxPy that we implemented to conduct the study.

An Empirical Study of Fault Localization in Python Programs

TL;DR

This study presents a differentiated replication of fault localization findings from Java research, targeting Python with 340 BugsInPy bugs across 13 projects and employing FauxPy to run spectrum-based, mutation-based, predicate-switching, and stack-trace fault localization methods at statement, function, and module granularity. Key results show that SBFL generally outperforms MBFL and the specialized PS/ST approaches in Python, with MBFL providing complementary strengths on certain bug types (notably mutable bugs) at a higher runtime cost. Combining techniques (CombineFL and AvgFL) yields notable improvements in effectiveness, often with moderate overhead, and granularity plays a significant role, with coarser granularity improving localization performance. Across project categories, data science bugs are harder to localize, while crashes benefit from stack-trace information; Python FL results largely align with Java findings, reinforcing the viability of classical FL methods in Python and offering a rich replication package for future studies. The work provides valuable benchmarks and tools for the community, enabling deeper, reproducible investigations into Python fault localization and cross-language comparisons.

Abstract

Despite its massive popularity as a programming language, especially in novel domains like data science programs, there is comparatively little research about fault localization that targets Python. Even though it is plausible that several findings about programming languages like C/C++ and Java -- the most common choices for fault localization research -- carry over to other languages, whether the dynamic nature of Python and how the language is used in practice affect the capabilities of classic fault localization approaches remain open questions to investigate. This paper is the first multi-family large-scale empirical study of fault localization on real-world Python programs and faults. Using Zou et al.'s recent large-scale empirical study of fault localization in Java as the basis of our study, we investigated the effectiveness (i.e., localization accuracy), efficiency (i.e., runtime performance), and other features (e.g., different entity granularities) of seven well-known fault-localization techniques in four families (spectrum-based, mutation-based, predicate switching, and stack-trace based) on 135 faults from 13 open-source Python projects from the BugsInPy curated collection. The results replicate for Python several results known about Java, and shed light on whether Python's peculiarities affect the capabilities of fault localization. The replication package that accompanies this paper includes detailed data about our experiments, as well as the tool FauxPy that we implemented to conduct the study.
Paper Structure (74 sections, 5 equations, 11 figures, 14 tables)

This paper contains 74 sections, 5 equations, 11 figures, 14 tables.

Figures (11)

  • Figure 1: SBFL formulas to compute the suspiciousness score of an entity $e$ given tests $T = P \cup F$ partitioned into passing $P$ and failing $F$. All formulas compute a score that is higher the more failing tests $F^+(e)$ cover $e$, and lower the more passing tests $P^+(e)$ cover $e$---capturing the basic heuristics of SBFL.
  • Figure 2: MBFL formulas to compute the suspiciousness score of a mutant $m$ given tests $T = P \cup F$ partitioned into passing $P$ and failing $F$. All formulas compute a score that is higher the more failing tests $F^k(m)$ kill $m$, and lower the more passing tests $P^k(m)$ kill $m$---capturing the basic heuristics of mutation analysis. On the right, MBFL formulas to compute the suspiciousness score of a program entity $e$ by aggregating the suspiciousness score of all mutants $m \in M$ that modified $e$ in the original program.
  • Figure 3: An example of program edit, and the corresponding ground truth faulty locations.
  • Figure 4: Classification of the /flpy/ /tmp/value/flpy/ /tmp/foundfound /tmp/found /tmp/multiplier /tmp/value0 -NoValue- /tmp/value /tmp/value [1.2]% ?? number of subjects [0.5]BugsInPy bugs used in our experiments into three categories.
  • Figure 5: Definitions of common FL effectiveness metrics. The top row shows two variants $\mathcal{I}\IfNoValueF{-NoValue-}{_{-NoValue-}}\IfNoValueF{-NoValue-}{(-NoValue-)}$, $\widetilde{\mathcal{I}}\IfNoValueF{-NoValue-}{_{-NoValue-}}\IfNoValueF{-NoValue-}{(-NoValue-)}$ of the $E_{\mathrm{inspect}}$ metric, and the exam score $\mathcal{E}$, for a generic bug $b$ and fault localization technique $L$. The bottom row shows cumulative metrics for a set $B$ of bugs: the "at $n$" metric $L@_Bn$, and the average $\widetilde{\mathcal{I}}\IfNoValueF{-NoValue-}{_{-NoValue-}}\IfNoValueF{-NoValue-}{(-NoValue-)}$ and $\mathcal{E}$ metrics.
  • ...and 6 more figures