You Name It, I Run It: An LLM Agent to Execute Tests of Arbitrary Projects

Islem Bouzenia; Michael Pradel

You Name It, I Run It: An LLM Agent to Execute Tests of Arbitrary Projects

Islem Bouzenia, Michael Pradel

TL;DR

This work tackles the problem of automatically building and running test suites for arbitrary software projects, which is difficult due to language diversity, tooling, and incomplete documentation. It introduces Execution-Agent, an autonomous LLM-based agent that uses meta-prompting to gather up-to-date guidelines and a two-phase process (Preparation and Feedback Loop) guided by a control center to generate environment and build/test scripts. The approach demonstrates substantial effectiveness across 50 open-source projects in 14 languages, outperforming language-specific baselines and general-purpose agents, with ground-truth test outcomes closely matched and acceptable per-project costs (average about 74 minutes and $0.16). The results suggest Execution-Agent can become a valuable tool for developers, automated programming systems, and researchers needing scalable, cross-language test execution capabilities, while offering insights into design decisions, tool usage, and failure modes for autonomous software engineering agents.

Abstract

The ability to execute the test suite of a project is essential in many scenarios, e.g., to assess code quality and code coverage, to validate code changes made by developers or automated tools, and to ensure compatibility with dependencies. Despite its importance, executing the test suite of a project can be challenging in practice because different projects use different programming languages, software ecosystems, build systems, testing frameworks, and other tools. These challenges make it difficult to create a reliable, universal test execution method that works across different projects. This paper presents ExecutionAgent, an automated technique that prepares scripts for building an arbitrary project from source code and running its test cases. Inspired by the way a human developer would address this task, our approach is a large language model (LLM)-based agent that autonomously executes commands and interacts with the host system. The agent uses meta-prompting to gather guidelines on the latest technologies related to the given project, and it iteratively refines its process based on feedback from the previous steps. Our evaluation applies ExecutionAgent to 50 open-source projects that use 14 different programming languages and many different build and testing tools. The approach successfully executes the test suites of 33/50 projects, while matching the test results of ground truth test suite executions with a deviation of only 7.5%. These results improve over the best previously available technique by 6.6x. The costs imposed by the approach are reasonable, with an execution time of 74 minutes and LLM costs of USD 0.16, on average per project. We envision ExecutionAgent to serve as a valuable tool for developers, automated programming tools, and researchers that need to execute tests across a wide variety of projects.

You Name It, I Run It: An LLM Agent to Execute Tests of Arbitrary Projects

TL;DR

Abstract

You Name It, I Run It: An LLM Agent to Execute Tests of Arbitrary Projects

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)