A Comprehensive Empirical Evaluation of Agent Frameworks on Code-centric Software Engineering Tasks
Zhuowen Yin, Cuifeng Gao, Chunsong Fan, Wenzhang Yang, Yinxing Xue, Lijun Zhang
TL;DR
The paper presents a comprehensive empirical comparison of seven general-purpose agent frameworks across three code-centric software-engineering tasks (software development, vulnerability detection, and program repair). It introduces a unified evaluation along effectiveness, efficiency, and overhead using standardized benchmarks (SRDD, LLM-SmartAudit, SWE-bench Lite) and a consistent backend LLM, enabling fair cross-task comparison. Findings show distinct patterns: OpenHands excels in software-development quality, GPTswarm leads in vulnerability detection, and program repair remains challenging with moderate success; multi-agent approaches yield longer reasoning trails but not always higher effectiveness due to coordination and tool-access limits. The study provides practical guidance for framework selection and design optimizations, highlighting the importance of task-specific tooling, trajectory summarization, and token-cost considerations for real-world adoption.
Abstract
Unlike traditional automation tools or static LLM-based systems, agents combine decision-making and tool utilization to accomplish complex tasks, showing great potential in software engineering. However, existing studies largely focus on specific tasks or isolated aspects, providing an incomplete picture of agents' practical capabilities. To address this, we conduct a comprehensive empirical study evaluating seven general-purpose agent frameworks across three representative code-centric tasks: software development, vulnerability detection, and program repair. Each task is assessed using standard, widely adopted benchmarks to ensure objective and comparable evaluation. Agent performance is systematically analyzed from three complementary perspectives: effectiveness (task success), efficiency (execution process), and overhead (token consumption). Our findings reveal distinct capability patterns and trade-offs among the evaluated frameworks. In terms of effectiveness, agents achieve moderate overall performance. Regarding efficiency, AgentOrchestra tends to exhibit the longest trajectories and the most correction attempts due to coordination overhead, whereas OpenHands demonstrate stronger reflective reasoning abilities. For overhead, software development incurs the highest monetary cost, while GPTswarm remains the most cost-efficient. Furthermore, we conduct an in-depth cross-analysis of the relationship between effectiveness and efficiency, exploring the underlying reasons behind their interplay. These findings guide both practical adoption and future research toward more efficient software engineering agents.
