Table of Contents
Fetching ...

Repo2Run: Automated Building Executable Environment for Code Repository at Scale

Ruida Hu, Chao Peng, Xinchen Wang, Junjielong Xu, Cuiyun Gao

TL;DR

Repo2Run introduces the first LLM-based agent specifically designed to automate the construction of executable testing environments for code repositories at scale. It employs a dual-environment architecture with an internal build container and an external helper, plus a rollback-enabled workflow and a Dockerfile synthesizer to replay successful builds as runnable Dockerfiles. Evaluated on a benchmark of 420 Python repositories, it achieves 86.0% environment-building success and 100% Dockerfile viability, outperforming baselines by a wide margin. The approach reduces manual effort in environment provisioning, enabling scalable creation of reproducible code execution environments and facilitating large-scale software engineering data collection for modeling and research.

Abstract

Scaling up executable code data is significant for improving language models' software engineering capability. The intricate nature of the process makes it labor-intensive, time-consuming and expert-knowledge-dependent to build a large number of executable code repositories, limiting the scalability of existing work based on running tests. The primary bottleneck lies in the automated building of test environments for different repositories, which is an essential yet underexplored task. To mitigate the gap, we introduce Repo2Run, the first LLM-based agent aiming at automating the building of executable test environments for any repositories at scale. Specifically, given a code repository, Repo2Run iteratively builds the Docker image, runs unit tests based on the feedback of the building, and synthesizes the Dockerfile until the entire pipeline is executed successfully. The resulting Dockerfile can then be used to create Docker container environments for running code and tests. We created a benchmark containing 420 Python repositories with unit tests for evaluation. The results illustrate that Repo2Run achieves an 86.0% success rate, outperforming SWE-agent by 77.0%. The resources of Repo2Run are available at https://github.com/bytedance/Repo2Run.

Repo2Run: Automated Building Executable Environment for Code Repository at Scale

TL;DR

Repo2Run introduces the first LLM-based agent specifically designed to automate the construction of executable testing environments for code repositories at scale. It employs a dual-environment architecture with an internal build container and an external helper, plus a rollback-enabled workflow and a Dockerfile synthesizer to replay successful builds as runnable Dockerfiles. Evaluated on a benchmark of 420 Python repositories, it achieves 86.0% environment-building success and 100% Dockerfile viability, outperforming baselines by a wide margin. The approach reduces manual effort in environment provisioning, enabling scalable creation of reproducible code execution environments and facilitating large-scale software engineering data collection for modeling and research.

Abstract

Scaling up executable code data is significant for improving language models' software engineering capability. The intricate nature of the process makes it labor-intensive, time-consuming and expert-knowledge-dependent to build a large number of executable code repositories, limiting the scalability of existing work based on running tests. The primary bottleneck lies in the automated building of test environments for different repositories, which is an essential yet underexplored task. To mitigate the gap, we introduce Repo2Run, the first LLM-based agent aiming at automating the building of executable test environments for any repositories at scale. Specifically, given a code repository, Repo2Run iteratively builds the Docker image, runs unit tests based on the feedback of the building, and synthesizes the Dockerfile until the entire pipeline is executed successfully. The resulting Dockerfile can then be used to create Docker container environments for running code and tests. We created a benchmark containing 420 Python repositories with unit tests for evaluation. The results illustrate that Repo2Run achieves an 86.0% success rate, outperforming SWE-agent by 77.0%. The resources of Repo2Run are available at https://github.com/bytedance/Repo2Run.

Paper Structure

This paper contains 48 sections, 2 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: The pipeline of code repository mining and manual environment building. Developers manually write Dockerfiles through iterative steps including base environment selection, dependency installation, test running, error handling, validating the environment by running unit tests.
  • Figure 2: Four error types (highlighted in different colors) that SWE-agent fails to resolve during environment building.
  • Figure 3: The illustration of a command executing failed and "polluting" the environment. (a) Failed commands like "commandA" can irreversibly "pollute" the environment by altering packages, files, or directories, making subsequent builds unstable. (b) To reproduce the changes, such a failed command "RUN commandA" needs to be added to the Dockerfile. However, adding it will lead to building failure.
  • Figure 4: The workflow of Repo2Run, involving two phases: the build phase and the record phase. The build phase utilizes a dual-environment architecture: the internal environment with five actions for environment building, while the external environment with three actions assists the internal environment. The record phase converts the validated command sequence into a runnable Dockerfile for reconstructing the executable environment. See Appendix \ref{['Repo2Run_example']} for more examples of these actions.
  • Figure 5: Rules for Dockerfile synthesis, illustrating how the Dockerfile synthesizer maps executed commands into Dockerfile statements using four keywords: "FROM", "ENV", "COPY", and "RUN". Black arrows represent the creation of statements, while green arrows indicate their transformations. Red text next to the arrows specifies the commands executed during each step.
  • ...and 7 more figures