Table of Contents
Fetching ...

Process-Level Trajectory Evaluation for Environment Configuration in Software Engineering Agents

Jiayi Kuang, Yinghui Li, Xin Zhang, Yangning Li, Di Yin, Xing Sun, Ying Shen, Philip S. Yu

TL;DR

Environment configuration is a bottleneck for SWE agents, and end-to-end benchmarks fail to reveal intermediate capabilities. The authors propose EnConda-Bench, a process-level trajectory benchmark with an automated data-construction pipeline that injects realistic errors into READMEs and validates them in Docker, enabling fine-grained analysis of planning, perception, feedback, and action. Across multiple LLMs and agent frameworks, agents reliably localize errors but struggle to translate feedback into robust corrective actions, limiting end-to-end success. The framework yields actionable insights for improving environment configuration capabilities and offers a scalable path toward richer, trajectory-based training data for software engineering agents.

Abstract

Large language model-based agents show promise for software engineering, but environment configuration remains a bottleneck due to heavy manual effort and scarce large-scale, high-quality datasets. Existing benchmarks assess only end-to-end build/test success, obscuring where and why agents succeed or fail. We introduce the Environment Configuration Diagnosis Benchmark, Enconda-bench, which provides process-level trajectory assessment of fine-grained agent capabilities during environment setup-planning, perception-driven error diagnosis, feedback-driven repair, and action to execute final environment configuration. Our task instances are automatically constructed by injecting realistic README errors and are validated in Docker for scalable, high-quality evaluation. Enconda-bench combines process-level analysis with end-to-end executability to enable capability assessments beyond aggregate success rates. Evaluations across state-of-the-art LLMs and agent frameworks show that while agents can localize errors, they struggle to translate feedback into effective corrections, limiting end-to-end performance. To our knowledge, Enconda-bench is the first framework to provide process-level internal capability assessment for environment configuration, offering actionable insights for improving software engineering agents.

Process-Level Trajectory Evaluation for Environment Configuration in Software Engineering Agents

TL;DR

Environment configuration is a bottleneck for SWE agents, and end-to-end benchmarks fail to reveal intermediate capabilities. The authors propose EnConda-Bench, a process-level trajectory benchmark with an automated data-construction pipeline that injects realistic errors into READMEs and validates them in Docker, enabling fine-grained analysis of planning, perception, feedback, and action. Across multiple LLMs and agent frameworks, agents reliably localize errors but struggle to translate feedback into robust corrective actions, limiting end-to-end success. The framework yields actionable insights for improving environment configuration capabilities and offers a scalable path toward richer, trajectory-based training data for software engineering agents.

Abstract

Large language model-based agents show promise for software engineering, but environment configuration remains a bottleneck due to heavy manual effort and scarce large-scale, high-quality datasets. Existing benchmarks assess only end-to-end build/test success, obscuring where and why agents succeed or fail. We introduce the Environment Configuration Diagnosis Benchmark, Enconda-bench, which provides process-level trajectory assessment of fine-grained agent capabilities during environment setup-planning, perception-driven error diagnosis, feedback-driven repair, and action to execute final environment configuration. Our task instances are automatically constructed by injecting realistic README errors and are validated in Docker for scalable, high-quality evaluation. Enconda-bench combines process-level analysis with end-to-end executability to enable capability assessments beyond aggregate success rates. Evaluations across state-of-the-art LLMs and agent frameworks show that while agents can localize errors, they struggle to translate feedback into effective corrections, limiting end-to-end performance. To our knowledge, Enconda-bench is the first framework to provide process-level internal capability assessment for environment configuration, offering actionable insights for improving software engineering agents.

Paper Structure

This paper contains 37 sections, 9 figures, 4 tables.

Figures (9)

  • Figure 1: The illustration of common problems in environment configuration. When human engineers configure the environment, they often encounter various errors. They should first identify the step where the error occurred and then fix the problem before proceeding to the next step, until the configuration is complete. Similarly, intelligent agents performing environment configuration should possess good planning, perception, feedback, and action capabilities.
  • Figure 2: An example of the overall workflow of our process-level environment configuration task.
  • Figure 3: The illustration of our overall pipeline of benchmark construction.
  • Figure 4: Data statistics results.
  • Figure 5: The statistics of the error numbers of the golden label and the model's prediction.
  • ...and 4 more figures