Table of Contents
Fetching ...

Test vs Mutant: Adversarial LLM Agents for Robust Unit Test Generation

Pengyu Chang, Yixiong Fang, Silin Chen, Yuling Shi, Beijun Shen, Xiaodong Gu

TL;DR

AdverTest tackles robustness gaps in automated unit test generation by jointly optimizing test quality and fault-detection robustness. It introduces an adversarial loop between a Test Case Generation Agent and a Mutant Generation Agent, using coverage $C$ and mutation score $S$ as bidirectional feedback to co-evolve test suites and mutants. It formalizes mutation testing inputs into the generation process and demonstrates improved fault detection rates (e.g., up to 8.56% over HITS and 63.30% over EvoSuite) with competitive line and branch coverage on Defects4J and GrowingBugs. The work offers replication-ready resources and highlights the practical impact of adversarial, mutation-guided test generation for real-world software robustness.

Abstract

Software testing is a critical, yet resource-intensive phase of the software development lifecycle. Over the years, various automated tools have been developed to aid in this process. Search-based approaches typically achieve high coverage but produce tests with low readability, whereas large language model (LLM)-based methods generate more human-readable tests but often suffer from low coverage and compilability. While the majority of research efforts have focused on improving test coverage and readability, little attention has been paid to enhancing the robustness of bug detection, particularly in exposing corner cases and vulnerable execution paths. To address this gap, we propose AdverTest, a novel adversarial framework for LLM-powered test case generation. AdverTest comprises two interacting agents: a test case generation agent (T) and a mutant generation agent (M). These agents engage in an adversarial loop, where M persistently creates new mutants "hacking" the blind spots of T's current test suite, while T iteratively refines its test cases to "kill" the challenging mutants produced by M. This interaction loop is guided by both coverage and mutation scores, enabling the system to co-evolve toward both high test coverage and bug detection capability. Experimental results in the Defects4J dataset show that our approach improves fault detection rates by 8.56% over the best existing LLM-based methods and by 63.30% over EvoSuite, while also improving line and branch coverage.

Test vs Mutant: Adversarial LLM Agents for Robust Unit Test Generation

TL;DR

AdverTest tackles robustness gaps in automated unit test generation by jointly optimizing test quality and fault-detection robustness. It introduces an adversarial loop between a Test Case Generation Agent and a Mutant Generation Agent, using coverage and mutation score as bidirectional feedback to co-evolve test suites and mutants. It formalizes mutation testing inputs into the generation process and demonstrates improved fault detection rates (e.g., up to 8.56% over HITS and 63.30% over EvoSuite) with competitive line and branch coverage on Defects4J and GrowingBugs. The work offers replication-ready resources and highlights the practical impact of adversarial, mutation-guided test generation for real-world software robustness.

Abstract

Software testing is a critical, yet resource-intensive phase of the software development lifecycle. Over the years, various automated tools have been developed to aid in this process. Search-based approaches typically achieve high coverage but produce tests with low readability, whereas large language model (LLM)-based methods generate more human-readable tests but often suffer from low coverage and compilability. While the majority of research efforts have focused on improving test coverage and readability, little attention has been paid to enhancing the robustness of bug detection, particularly in exposing corner cases and vulnerable execution paths. To address this gap, we propose AdverTest, a novel adversarial framework for LLM-powered test case generation. AdverTest comprises two interacting agents: a test case generation agent (T) and a mutant generation agent (M). These agents engage in an adversarial loop, where M persistently creates new mutants "hacking" the blind spots of T's current test suite, while T iteratively refines its test cases to "kill" the challenging mutants produced by M. This interaction loop is guided by both coverage and mutation scores, enabling the system to co-evolve toward both high test coverage and bug detection capability. Experimental results in the Defects4J dataset show that our approach improves fault detection rates by 8.56% over the best existing LLM-based methods and by 63.30% over EvoSuite, while also improving line and branch coverage.
Paper Structure (31 sections, 3 equations, 3 figures, 2 tables, 2 algorithms)

This paper contains 31 sections, 3 equations, 3 figures, 2 tables, 2 algorithms.

Figures (3)

  • Figure 1: Overview of AdverTest. Agents T and M alternatively generate tests and create mutants, guided by coverage and mutation‐score feedback.
  • Figure 2: Mutation Score (MS), Line Coverage (CV), and Fault Detection Rate across Nine Rounds. The shadow indicates the standard deviation.
  • Figure 3: An Example of Fault Detection Process for arrangeFF.