Table of Contents
Fetching ...

Rethinking the Value of Agent-Generated Tests for LLM-Based Software Engineering Agents

Zhi Chen, Zhensu Sun, Yuling Shi, Chao Peng, Xiaodong Gu, David Lo, Lingxiao Jiang

TL;DR

This paper interrogates whether agent-generated tests in autonomous LLM-based SWE agents meaningfully aid GitHub issue resolution. Using SWE-bench Verified trajectories with a Bash-only mini-SWE-agent across six LLF families, it analyzes emergent testing behaviors, the content of agent-written tests, and the causal impact of prompting interventions to encourage or discourage test writing. It finds that test writing is highly model-dependent, tests mostly provide observational feedback via value-revealing prints rather than robust verifications, and prompting shifts in test-writing volume yield limited changes in final outcomes but substantial changes in resource usage. The work contributes a behavioral analysis of agent-written tests, an AST-based four-category assertion classifier, and a causal evaluation framework for testing prompts, highlighting the need for cost-aware, smarter testing strategies in high-autonomy software agents.

Abstract

Large Language Model (LLM) code agents increasingly resolve repository-level issues by iteratively editing code, invoking tools, and validating candidate patches. In these workflows, agents often write tests on the fly, a paradigm adopted by many high-ranking agents on the SWE-bench leaderboard. However, we observe that GPT-5.2, which writes almost no new tests, can even achieve performance comparable to top-ranking agents. This raises the critical question: whether such tests meaningfully improve issue resolution or merely mimic human testing practices while consuming a substantial interaction budget. To reveal the impact of agent-written tests, we present an empirical study that analyzes agent trajectories across six state-of-the-art LLMs on SWE-bench Verified. Our results show that while test writing is commonly adopted, but resolved and unresolved tasks within the same model exhibit similar test-writing frequencies Furthermore, these tests typically serve as observational feedback channels, where agents prefer value-revealing print statements significantly more than formal assertion-based checks. Based on these insights, we perform a controlled experiment by revising the prompts of four agents to either increase or reduce test writing. The results suggest that changes in the volume of agent-written tests do not significantly change final outcomes. Taken together, our study reveals that current test-writing practices may provide marginal utility in autonomous software engineering tasks.

Rethinking the Value of Agent-Generated Tests for LLM-Based Software Engineering Agents

TL;DR

This paper interrogates whether agent-generated tests in autonomous LLM-based SWE agents meaningfully aid GitHub issue resolution. Using SWE-bench Verified trajectories with a Bash-only mini-SWE-agent across six LLF families, it analyzes emergent testing behaviors, the content of agent-written tests, and the causal impact of prompting interventions to encourage or discourage test writing. It finds that test writing is highly model-dependent, tests mostly provide observational feedback via value-revealing prints rather than robust verifications, and prompting shifts in test-writing volume yield limited changes in final outcomes but substantial changes in resource usage. The work contributes a behavioral analysis of agent-written tests, an AST-based four-category assertion classifier, and a causal evaluation framework for testing prompts, highlighting the need for cost-aware, smarter testing strategies in high-autonomy software agents.

Abstract

Large Language Model (LLM) code agents increasingly resolve repository-level issues by iteratively editing code, invoking tools, and validating candidate patches. In these workflows, agents often write tests on the fly, a paradigm adopted by many high-ranking agents on the SWE-bench leaderboard. However, we observe that GPT-5.2, which writes almost no new tests, can even achieve performance comparable to top-ranking agents. This raises the critical question: whether such tests meaningfully improve issue resolution or merely mimic human testing practices while consuming a substantial interaction budget. To reveal the impact of agent-written tests, we present an empirical study that analyzes agent trajectories across six state-of-the-art LLMs on SWE-bench Verified. Our results show that while test writing is commonly adopted, but resolved and unresolved tasks within the same model exhibit similar test-writing frequencies Furthermore, these tests typically serve as observational feedback channels, where agents prefer value-revealing print statements significantly more than formal assertion-based checks. Based on these insights, we perform a controlled experiment by revising the prompts of four agents to either increase or reduce test writing. The results suggest that changes in the volume of agent-written tests do not significantly change final outcomes. Taken together, our study reveals that current test-writing practices may provide marginal utility in autonomous software engineering tasks.
Paper Structure (51 sections, 2 equations, 3 figures, 7 tables)

This paper contains 51 sections, 2 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Overview of our study design. RQ1 profiles emergent testing behaviours (test writing frequency, timing, and execution analysis). RQ2 characterizes the feedback signals encoded in agent-written tests (assertions vs. value-revealing prints) and the types of assertions. RQ3 applies prompt interventions to encourage or discourage writing tests, and measures both outcome impact and efficiency impact.
  • Figure 2: Composition of feedback signals in agent-written tests across models. Value-revealing prints dominate over assertions for all models.
  • Figure 3: Outcome-transition distribution on tasks with an intended test-status change