Process-Level Trajectory Evaluation for Environment Configuration in Software Engineering Agents

Jiayi Kuang; Yinghui Li; Xin Zhang; Yangning Li; Di Yin; Xing Sun; Ying Shen; Philip S. Yu

Process-Level Trajectory Evaluation for Environment Configuration in Software Engineering Agents

Jiayi Kuang, Yinghui Li, Xin Zhang, Yangning Li, Di Yin, Xing Sun, Ying Shen, Philip S. Yu

TL;DR

Environment configuration is a bottleneck for SWE agents, and end-to-end benchmarks fail to reveal intermediate capabilities. The authors propose EnConda-Bench, a process-level trajectory benchmark with an automated data-construction pipeline that injects realistic errors into READMEs and validates them in Docker, enabling fine-grained analysis of planning, perception, feedback, and action. Across multiple LLMs and agent frameworks, agents reliably localize errors but struggle to translate feedback into robust corrective actions, limiting end-to-end success. The framework yields actionable insights for improving environment configuration capabilities and offers a scalable path toward richer, trajectory-based training data for software engineering agents.

Abstract

Large language model-based agents show promise for software engineering, but environment configuration remains a bottleneck due to heavy manual effort and scarce large-scale, high-quality datasets. Existing benchmarks assess only end-to-end build/test success, obscuring where and why agents succeed or fail. We introduce the Environment Configuration Diagnosis Benchmark, Enconda-bench, which provides process-level trajectory assessment of fine-grained agent capabilities during environment setup-planning, perception-driven error diagnosis, feedback-driven repair, and action to execute final environment configuration. Our task instances are automatically constructed by injecting realistic README errors and are validated in Docker for scalable, high-quality evaluation. Enconda-bench combines process-level analysis with end-to-end executability to enable capability assessments beyond aggregate success rates. Evaluations across state-of-the-art LLMs and agent frameworks show that while agents can localize errors, they struggle to translate feedback into effective corrections, limiting end-to-end performance. To our knowledge, Enconda-bench is the first framework to provide process-level internal capability assessment for environment configuration, offering actionable insights for improving software engineering agents.

Process-Level Trajectory Evaluation for Environment Configuration in Software Engineering Agents

TL;DR

Abstract

Process-Level Trajectory Evaluation for Environment Configuration in Software Engineering Agents

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)