Table of Contents
Fetching ...

Efficient Black-Box Fault Localization for System-Level Test Code Using Large Language Models

Ahmadreza Saboor Yaraghi, Golnaz Gharachorlu, Sakina Fatima, Lionel C. Briand, Ruiyuan Wan, Ruifeng Gao

TL;DR

This paper introduces a fully static, LLM-powered approach to system-level test code fault localization (TCFL) that operates without executing tests. It jointly tackles execution-trace estimation from a single failure log and prompt-based ranking of faulty locations, employing three trace-estimation algorithms (fill-in-the-gaps, CFG-based pruning, and call-site refinement) and a targeted prompt template to reduce input size and search space. Evaluations on an industrial Python dataset show that the estimated traces closely approximate real traces (F1 around 90%), while pruning reduces LLM inference time by up to 34% with minimal impact on fault localization accuracy. Compared to state-of-the-art baselines adapted for TCFL, the proposed method delivers equal or better FL performance with substantially improved scalability and efficiency, particularly at the block level. The work demonstrates the practicality of execution-free TCFL and highlights promising directions for extending trace estimation, prompt design, and cross-language applicability.”

Abstract

Fault localization (FL) is a critical step in debugging, which typically relies on repeated executions to pinpoint faulty code regions. However, repeated executions can be impractical in the presence of non-deterministic failures or high execution costs. While recent efforts have leveraged Large Language Models (LLMs) to aid execution-free FL, these have primarily focused on identifying faults in the system-under-test (SUT) rather than in the often complex system-level test code. However, the latter is also important, as in practice, many failures are triggered by faulty test code. To overcome these challenges, we introduce a fully static, LLM-driven approach for system-level test code fault localization (TCFL) that does not require executing the test case. Our method uses a single failure execution log to estimate the test's execution trace through three novel algorithms that identify only code statements likely involved in the failure. This pruned trace, combined with the error message, is used to prompt the LLM to rank potential faulty locations. Our black-box, system-level approach requires no access to the SUT source code and is applicable to complex test scripts that assess full system behavior. We evaluate our technique at the function, block, and line levels using an industrial dataset of faulty test cases that were not used in pre-training LLMs. Results show that our best-estimated traces closely match the actual traces, with an F1 score of around 90%. Additionally, pruning the complex system-level test code reduces the LLM's inference time by up to 34% without any loss in FL performance. Our method achieves equal or higher FL accuracy, requiring over 85% less average inference time per test case and 93% fewer tokens than the latest LLM-guided FL method.

Efficient Black-Box Fault Localization for System-Level Test Code Using Large Language Models

TL;DR

This paper introduces a fully static, LLM-powered approach to system-level test code fault localization (TCFL) that operates without executing tests. It jointly tackles execution-trace estimation from a single failure log and prompt-based ranking of faulty locations, employing three trace-estimation algorithms (fill-in-the-gaps, CFG-based pruning, and call-site refinement) and a targeted prompt template to reduce input size and search space. Evaluations on an industrial Python dataset show that the estimated traces closely approximate real traces (F1 around 90%), while pruning reduces LLM inference time by up to 34% with minimal impact on fault localization accuracy. Compared to state-of-the-art baselines adapted for TCFL, the proposed method delivers equal or better FL performance with substantially improved scalability and efficiency, particularly at the block level. The work demonstrates the practicality of execution-free TCFL and highlights promising directions for extending trace estimation, prompt design, and cross-language applicability.”

Abstract

Fault localization (FL) is a critical step in debugging, which typically relies on repeated executions to pinpoint faulty code regions. However, repeated executions can be impractical in the presence of non-deterministic failures or high execution costs. While recent efforts have leveraged Large Language Models (LLMs) to aid execution-free FL, these have primarily focused on identifying faults in the system-under-test (SUT) rather than in the often complex system-level test code. However, the latter is also important, as in practice, many failures are triggered by faulty test code. To overcome these challenges, we introduce a fully static, LLM-driven approach for system-level test code fault localization (TCFL) that does not require executing the test case. Our method uses a single failure execution log to estimate the test's execution trace through three novel algorithms that identify only code statements likely involved in the failure. This pruned trace, combined with the error message, is used to prompt the LLM to rank potential faulty locations. Our black-box, system-level approach requires no access to the SUT source code and is applicable to complex test scripts that assess full system behavior. We evaluate our technique at the function, block, and line levels using an industrial dataset of faulty test cases that were not used in pre-training LLMs. Results show that our best-estimated traces closely match the actual traces, with an F1 score of around 90%. Additionally, pruning the complex system-level test code reduces the LLM's inference time by up to 34% without any loss in FL performance. Our method achieves equal or higher FL accuracy, requiring over 85% less average inference time per test case and 93% fewer tokens than the latest LLM-guided FL method.

Paper Structure

This paper contains 42 sections, 4 equations, 5 figures, 10 tables, 4 algorithms.

Figures (5)

  • Figure 1: Two faulty test statements in the CPython project, causing assertion failure at commit https://github.com/python/cpython/blob/7628f67d55cb65bad9c9266e0457e468cd7e3775/Lib/test/test_math.py/#L815, along with their fixed variants.
  • Figure 2: An illustration of black-box system-level testing, where a fault in the test script leads to a failure observed in the test execution log. The test script interacts with the SUT without having direct access to its source code.
  • Figure 3: A faulty test code and its corresponding execution log.
  • Figure 4: Our prompt template for test code fault localization. Text enclosed in curly braces ({ }) represents variable placeholders dynamically filled during the prompting process.
  • Figure 5: Examples of the requested output format for function, block, and line-level fault localization.

Theorems & Definitions (5)

  • Definition 1: Basic Block of Code
  • Definition 2: Ranking-Based Fault Localization
  • Definition 3: Execution Trace
  • Definition 4: Per-Function Control Flow Graph
  • Definition 5: Static Log Statement