Table of Contents
Fetching ...

TestWeaver: Execution-aware, Feedback-driven Regression Testing Generation with Large Language Models

Cuong Chi Le, Cuong Duc Van, Tung Duy Vu, Thai Minh Pham Vu, Hoang Nhat Phan, Huy Nhat Phan, Tien N. Nguyen

TL;DR

TestWeaver tackles the coverage plateau in LLM-driven regression test generation by integrating lightweight program analysis into the prompting loop. It combines backward dynamic slicing to produce focused code slices, a closest-test retrieval strategy to ground the LLM in relevant execution context, and in-line execution annotations to reveal runtime states. Empirically, TestWeaver achieves higher line ($\$68\%$) and branch ($\$54\%$) coverage on 35 Python projects in the CM suite than state-of-the-art baselines, while also reducing token costs and accelerating convergence to peak coverage (about $\$2.76\times$ faster). The results validate a new direction that blends static/dynamic program analysis with LLMs to improve test generation efficiency, scalability, and effectiveness, with implications for broader program-analysis–assisted AI tooling.

Abstract

While recent advances in large language models (LLMs) have shown promise in automating test generation for regression testing, they often suffer from limited reasoning about program execution, resulting in stagnated coverage growth - a phenomenon known as the coverage plateau. This paper presents TestWeaver, a novel LLM-based approach that integrates lightweight program analysis to create a focused execution context that assists LLMs in better test generation. TestWeaver strategically chooses the following components to overcome LLMs' limited reasoning on complex execution: (1) it reduces hallucinations and improves focus by supplying the LLM with the backward slice from the target line instead of a full program context; (2) it identifies and incorporates close test cases - those that share control-flow similarities with the path to the target line - to provide focused execution context within the LLM's context window; and (3) it enhances LLM's reasoning with execution in-line annotations that encode variable states as comments along the executed path. By equipping LLMs with these targeted and contextualized inputs, it improves coverage-guided test generation and mitigates redundant explorations. Empirical results show that TestWeaver accelerates code coverage growth and generates more effective test cases than the state-of-the-art approaches.

TestWeaver: Execution-aware, Feedback-driven Regression Testing Generation with Large Language Models

TL;DR

TestWeaver tackles the coverage plateau in LLM-driven regression test generation by integrating lightweight program analysis into the prompting loop. It combines backward dynamic slicing to produce focused code slices, a closest-test retrieval strategy to ground the LLM in relevant execution context, and in-line execution annotations to reveal runtime states. Empirically, TestWeaver achieves higher line (68\%\) coverage on 35 Python projects in the CM suite than state-of-the-art baselines, while also reducing token costs and accelerating convergence to peak coverage (about 2.76\times$ faster). The results validate a new direction that blends static/dynamic program analysis with LLMs to improve test generation efficiency, scalability, and effectiveness, with implications for broader program-analysis–assisted AI tooling.

Abstract

While recent advances in large language models (LLMs) have shown promise in automating test generation for regression testing, they often suffer from limited reasoning about program execution, resulting in stagnated coverage growth - a phenomenon known as the coverage plateau. This paper presents TestWeaver, a novel LLM-based approach that integrates lightweight program analysis to create a focused execution context that assists LLMs in better test generation. TestWeaver strategically chooses the following components to overcome LLMs' limited reasoning on complex execution: (1) it reduces hallucinations and improves focus by supplying the LLM with the backward slice from the target line instead of a full program context; (2) it identifies and incorporates close test cases - those that share control-flow similarities with the path to the target line - to provide focused execution context within the LLM's context window; and (3) it enhances LLM's reasoning with execution in-line annotations that encode variable states as comments along the executed path. By equipping LLMs with these targeted and contextualized inputs, it improves coverage-guided test generation and mitigates redundant explorations. Empirical results show that TestWeaver accelerates code coverage growth and generates more effective test cases than the state-of-the-art approaches.

Paper Structure

This paper contains 28 sections, 7 figures, 7 tables, 1 algorithm.

Figures (7)

  • Figure 1: LLMs struggle with execution reasoning: an example of GPT-4o's generating a test case to cover line 27.
  • Figure 2: Improvement in test case generation using dynamic backward slicing
  • Figure 3: Overview of the TestWeaver pipeline. The process begins by generating initial test seeds and identifying uncovered lines through execution. For each uncovered line, a backward slice is computed to produce a sliced code, which is then provided to the LLM for test generation. If the generated test fails to cover the target, the re-generation stage is triggered: the closest failing test is retrieved from the current suite, its execution trace is used to annotate the sliced code with variable values (execution in-lines), and the enriched prompt is re-submitted to the LLM. Any successful test is added to the suite, and the process continues until all uncovered lines have been addressed or a time limit is reached.
  • Figure 4: Performance comparison on code coverage (RQ1)
  • Figure 5: Coverage improvements across three phases of TestWeaver on the CM suite (RQ2).
  • ...and 2 more figures