SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models
Jingxuan Xu, Ken Deng, Weihao Li, Songwei Yu, Huaixi Tang, Haoyang Huang, Zhiyi Lai, Zizheng Zhan, Yanan Wu, Chenchen Zhang, Kepeng Lei, Yifan Yao, Xinping Lei, Wenqiang Zhu, Zongxian Feng, Han Li, Junqi Xiong, Dailin Li, Zuchen Gao, Kun Wu, Wen Xiang, Ziqi Zhan, Yuanxing Zhang, Wuxuan Gong, Ziyuan Gao, Guanxiang Wang, Yirong Xue, Mengtong Li, Mengfei Xie, Xiaojiang Zhang, Jinghui Wang, Wenhao Zhuang, Zheng Lin, Huiming Wang, Zhaoxiang Zhang, Yuqun Zhang, Haotian Zhang, Bin Chen, Jiaheng Liu
TL;DR
SWE-Compass tackles the limitations of existing software-engineering benchmarks by introducing a unified, execution-grounded benchmark that spans 8 task types, 8 programming scenarios, and 10 languages across 2000 verified instances. It employs a five-step construction pipeline (from user analysis to data validation) to ensure real-world alignment, comprehensive coverage, and faithful evaluation through reproducible environments and task-specific metrics. The study evaluates ten diverse LLMs under two agentic frameworks (SWE-Agent and Claude Code) across executable and non-executable tracks, revealing a consistent hierarchy of task difficulty, language effects, and complementary strengths between frameworks. The findings highlight the importance of requirement grounding, environment reliability, and scalable reasoning over raw code generation, offering a rigorous foundation for diagnosing and advancing agentic coding capabilities in practice.
Abstract
Evaluating large language models (LLMs) for software engineering has been limited by narrow task coverage, language bias, and insufficient alignment with real-world developer workflows. Existing benchmarks often focus on algorithmic problems or Python-centric bug fixing, leaving critical dimensions of software engineering underexplored. To address these gaps, we introduce SWE-Compass1, a comprehensive benchmark that unifies heterogeneous code-related evaluations into a structured and production-aligned framework. SWE-Compass spans 8 task types, 8 programming scenarios, and 10 programming languages, with 2000 high-quality instances curated from authentic GitHub pull requests and refined through systematic filtering and validation. We benchmark ten state-of-the-art LLMs under two agentic frameworks, SWE-Agent and Claude Code, revealing a clear hierarchy of difficulty across task types, languages, and scenarios. Moreover, by aligning evaluation with real-world developer practices, SWE-Compass provides a rigorous and reproducible foundation for diagnosing and advancing agentic coding capabilities in large language models.
