SWE-bench-java: A GitHub Issue Resolving Benchmark for Java
Daoguang Zan, Zhirong Huang, Ailun Yu, Shaoxin Lin, Yifan Shi, Wei Liu, Dong Chen, Zongshuai Qi, Hao Yu, Lei Yu, Dezhi Ran, Muhan Zeng, Bo Shen, Pan Bian, Guangtai Liang, Bei Guan, Pengjie Huang, Tao Xie, Yongji Wang, Qianxiang Wang
TL;DR
To address the lack of multilingual issue-resolving benchmarks, this paper presents SWE-bench-java-verified, a Java-focused extension of SWE-bench with a public dataset, Docker-based evaluation environment, and leaderboard. It details a five-phase construction workflow from repository selection to manual verification and demonstrates SWE-agent-based evaluation using models such as GPT-4o and DeepSeek variants, reporting non-perfect but discriminative results across 91 issues in 6 repositories. The results show that more detailed issue descriptions improve problem-solving performance and reveal repository-dependent variations, underscoring the benchmark's challenge and its utility for guiding future research. The work lays groundwork for extending to additional languages and calls for community contributions to accelerate multilingual development and refinement.
Abstract
GitHub issue resolving is a critical task in software engineering, recently gaining significant attention in both industry and academia. Within this task, SWE-bench has been released to evaluate issue resolving capabilities of large language models (LLMs), but has so far only focused on Python version. However, supporting more programming languages is also important, as there is a strong demand in industry. As a first step toward multilingual support, we have developed a Java version of SWE-bench, called SWE-bench-java. We have publicly released the dataset, along with the corresponding Docker-based evaluation environment and leaderboard, which will be continuously maintained and updated in the coming months. To verify the reliability of SWE-bench-java, we implement a classic method SWE-agent and test several powerful LLMs on it. As is well known, developing a high-quality multi-lingual benchmark is time-consuming and labor-intensive, so we welcome contributions through pull requests or collaboration to accelerate its iteration and refinement, paving the way for fully automated programming.
