Table of Contents
Fetching ...

TDFlow: Agentic Workflows for Test Driven Software Engineering

Kevin Han, Siddharth Maddikayala, Tim Knappe, Om Patel, Austen Liao, Amir Barati Farimani

TL;DR

TDFlow proposes a modular, test driven agentic workflow that decomposes repository scale repair into four specialized sub agents to address human written reproduction tests. The approach frames software engineering as a test resolution problem, enabling iterative patch proposal, debugging, patch revision, and optional test generation within a tightly constrained toolset. Empirical results show strong performance with human written tests, achieving near human level test resolution on SWE-Bench Verified and significant gains over baselines on SWE-Bench Lite, while highlighting test generation as the remaining bottleneck. The work demonstrates the value of narrowly engineered, multi agent interactions for software repair and envisions a human–LLM collaborative loop where humans write tests solved by LLMs, moving toward autonomous repository repair as test generation quality improves.

Abstract

We introduce TDFlow, a novel test-driven agentic workflow that frames repository-scale software engineering as a test-resolution task, specifically designed to solve human-written tests. Given a set of tests, TDFlow repeatedly proposes, revises, and debugs repository-scale patches using precisely engineered sub-agents and tightly constrained tools. The workflow decomposes software engineering program repair into four components governed by respective sub-agents. This simple, forced decoupling of patch proposing, debugging, patch revision, and optional test generation (1) reduces long-context burden on any individual sub-agent, (2) focuses each sub-agent on specific, pre-defined sub-tasks, and (3) allows for specialized performance improvement on specific sub-tasks. When provided human-written tests, TDFlow attains 88.8% pass rate on SWE-Bench Lite (an absolute improvement of 27.8% over the next best system) and 94.3% on SWE-Bench Verified. Manual inspection of the 800 TDFlow runs within SWE-Bench Lite and Verified uncover only 7 instances of test hacking, which were subsequently counted as failures. Furthermore, we show that the primary obstacle to human-level software engineering performance lies within writing successful reproduction tests. We envision a human-LLM interactive system powered by TDFlow where human developers write tests solved by LLM systems. Together, these results indicate that modern LLMs, when embedded in a narrowly engineered, test-driven workflow, already achieve human-level test resolution -- with the final frontier for fully autonomous repository repair being the accurate generation of valid reproduction tests.

TDFlow: Agentic Workflows for Test Driven Software Engineering

TL;DR

TDFlow proposes a modular, test driven agentic workflow that decomposes repository scale repair into four specialized sub agents to address human written reproduction tests. The approach frames software engineering as a test resolution problem, enabling iterative patch proposal, debugging, patch revision, and optional test generation within a tightly constrained toolset. Empirical results show strong performance with human written tests, achieving near human level test resolution on SWE-Bench Verified and significant gains over baselines on SWE-Bench Lite, while highlighting test generation as the remaining bottleneck. The work demonstrates the value of narrowly engineered, multi agent interactions for software repair and envisions a human–LLM collaborative loop where humans write tests solved by LLMs, moving toward autonomous repository repair as test generation quality improves.

Abstract

We introduce TDFlow, a novel test-driven agentic workflow that frames repository-scale software engineering as a test-resolution task, specifically designed to solve human-written tests. Given a set of tests, TDFlow repeatedly proposes, revises, and debugs repository-scale patches using precisely engineered sub-agents and tightly constrained tools. The workflow decomposes software engineering program repair into four components governed by respective sub-agents. This simple, forced decoupling of patch proposing, debugging, patch revision, and optional test generation (1) reduces long-context burden on any individual sub-agent, (2) focuses each sub-agent on specific, pre-defined sub-tasks, and (3) allows for specialized performance improvement on specific sub-tasks. When provided human-written tests, TDFlow attains 88.8% pass rate on SWE-Bench Lite (an absolute improvement of 27.8% over the next best system) and 94.3% on SWE-Bench Verified. Manual inspection of the 800 TDFlow runs within SWE-Bench Lite and Verified uncover only 7 instances of test hacking, which were subsequently counted as failures. Furthermore, we show that the primary obstacle to human-level software engineering performance lies within writing successful reproduction tests. We envision a human-LLM interactive system powered by TDFlow where human developers write tests solved by LLM systems. Together, these results indicate that modern LLMs, when embedded in a narrowly engineered, test-driven workflow, already achieve human-level test resolution -- with the final frontier for fully autonomous repository repair being the accurate generation of valid reproduction tests.

Paper Structure

This paper contains 34 sections, 5 figures, 8 tables.

Figures (5)

  • Figure 1: The workflow behind Test-Driven Flow (TDFlow). TDFlow is a purely test driven, agentic workflow for resolving repository scale issues. The entrypoint to TDFlow begins with either human-written reproduction tests or, optionally, to have TDFlow generate reproduction tests. Afterwards, the tests are run and provided to the Explore Files LLM sub-agent with the sole task of exploring the repository in order to propose a patch. The tests are run on the proposed patch before the Debug One sub-agent debugs each failing test individually with a dedicated debugger tool and generates reports. Those reports are used by the Explore Files sub-agent to propose another patch.
  • Figure 2: The solid line depicts the solve/success rate at each Bad Test Rate (BTR) level for the LLM-generated mode on SWE-Bench Verified using GPT-5. Bad Test Rate is the number of unsuccessful reproduction tests divided by the total number of LLM-generated tests. The dashed line depicts the % of instances with the specified bad test rates. When BTR is 0, TDFlow has a 93.3% solve rate.
  • Figure 3: (a) The overall success rate of both TDFlow modes: when TDFlow is provided with human-written tests and when TDFlow is provided with LLM-generated tests. (b) The overall success rate of both modes as a function of the maximum cost per instance.
  • Figure 4: The distribution of f2p, p2p, p2f, and f2f tests. f2p refers to tests which fail before the gold patch is applied and pass after. f2f refers to tests which fail both before and after the gold patch is applied. p2f refers to tests which pass before the gold patch is applied and fails after. And p2p refers to tests which pass both before and after the gold patch is applied.
  • Figure 5: (a) A histogram of the distribution of test counts for the SWE-Bench Verified experiment where the LLM generates its own tests. The color bar refers to the solve rate of each bin. (b) The same distribution in histogram form except the color refers to the average BTR of each bin.