Table of Contents
Fetching ...

EnvBench: A Benchmark for Automated Environment Setup

Aleksandra Eliseeva, Alexander Kovrigin, Ilia Kholkin, Egor Bogomolov, Yaroslav Zharov

TL;DR

EnvBench addresses the challenge of automated environment setup for diverse software repositories by offering a large-scale benchmark across Python and JVM ecosystems, paired with two language-specific evaluation metrics based on static analysis and compilation. It evaluates three LLM-driven baselines (Zero-shot, Installamatic, Bash Agent) with GPT-4o backbones, revealing that even the best-performing approach correctly configures only a fraction of repositories (6.69% Python, 29.47% JVM). The study highlights both the potential of LLMs to reduce configuration errors and the frequent generation of faulty scripts when error feedback is absent, emphasizing the need for robust feedback and verification mechanisms. The public EnvBench suite enables scalable benchmarking and future extension to more languages and runtime-based checks, advancing research on automated repository setup.

Abstract

Recent advances in Large Language Models (LLMs) have enabled researchers to focus on practical repository-level tasks in software engineering domain. In this work, we consider a cornerstone task for automating work with software repositories-environment setup, i.e., a task of configuring a repository-specific development environment on a system. Existing studies on environment setup introduce innovative agentic strategies, but their evaluation is often based on small datasets that may not capture the full range of configuration challenges encountered in practice. To address this gap, we introduce a comprehensive environment setup benchmark EnvBench. It encompasses 329 Python and 665 JVM-based (Java, Kotlin) repositories, with a focus on repositories that present genuine configuration challenges, excluding projects that can be fully configured by simple deterministic scripts. To enable further benchmark extension and usage for model tuning, we implement two automatic metrics: a static analysis check for missing imports in Python and a compilation check for JVM languages. We demonstrate the applicability of our benchmark by evaluating three environment setup approaches, including a simple zero-shot baseline and two agentic workflows, that we test with two powerful LLM backbones, GPT-4o and GPT-4o-mini. The best approach manages to successfully configure 6.69% repositories for Python and 29.47% repositories for JVM, suggesting that EnvBench remains challenging for current approaches. Our benchmark suite is publicly available at https://github.com/JetBrains-Research/EnvBench. The dataset and experiment trajectories are available at https://jb.gg/envbench.

EnvBench: A Benchmark for Automated Environment Setup

TL;DR

EnvBench addresses the challenge of automated environment setup for diverse software repositories by offering a large-scale benchmark across Python and JVM ecosystems, paired with two language-specific evaluation metrics based on static analysis and compilation. It evaluates three LLM-driven baselines (Zero-shot, Installamatic, Bash Agent) with GPT-4o backbones, revealing that even the best-performing approach correctly configures only a fraction of repositories (6.69% Python, 29.47% JVM). The study highlights both the potential of LLMs to reduce configuration errors and the frequent generation of faulty scripts when error feedback is absent, emphasizing the need for robust feedback and verification mechanisms. The public EnvBench suite enables scalable benchmarking and future extension to more languages and runtime-based checks, advancing research on automated repository setup.

Abstract

Recent advances in Large Language Models (LLMs) have enabled researchers to focus on practical repository-level tasks in software engineering domain. In this work, we consider a cornerstone task for automating work with software repositories-environment setup, i.e., a task of configuring a repository-specific development environment on a system. Existing studies on environment setup introduce innovative agentic strategies, but their evaluation is often based on small datasets that may not capture the full range of configuration challenges encountered in practice. To address this gap, we introduce a comprehensive environment setup benchmark EnvBench. It encompasses 329 Python and 665 JVM-based (Java, Kotlin) repositories, with a focus on repositories that present genuine configuration challenges, excluding projects that can be fully configured by simple deterministic scripts. To enable further benchmark extension and usage for model tuning, we implement two automatic metrics: a static analysis check for missing imports in Python and a compilation check for JVM languages. We demonstrate the applicability of our benchmark by evaluating three environment setup approaches, including a simple zero-shot baseline and two agentic workflows, that we test with two powerful LLM backbones, GPT-4o and GPT-4o-mini. The best approach manages to successfully configure 6.69% repositories for Python and 29.47% repositories for JVM, suggesting that EnvBench remains challenging for current approaches. Our benchmark suite is publicly available at https://github.com/JetBrains-Research/EnvBench. The dataset and experiment trajectories are available at https://jb.gg/envbench.

Paper Structure

This paper contains 23 sections, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Overview of the workflow with EnvBench. The process begins with cloning a target repository. Next, the repository is passed as an input to an environment setup approach, which then produces a shell script to set up the repository as an output. Internally, it could be, for instance, a single LLM request or an AI agent building a script dynamically. Finally, in our evaluation suite, we execute the produced script and verify the environment is correctly configured through static analysis and compilation checks.
  • Figure 2: Baselines comparison for the Tablib repository. GPT-4o-mini is used for all baselines. The commented lines have been removed for brevity.
  • Figure 3: Most frequent Bash commands executed by Bash Agent with GPT-4o on Python dataset.
  • Figure 4: Most frequent Bash commands executed by Bash Agent with GPT-4o on JVM dataset.
  • Figure 5: Histograms for avgErrs---the average number of missing errors per repository---for expert-produced scripts and for scripts from Bash Agent with GPT-4o-mini obtained via bootstrap resampling (10,000 iterations).