A Benchmark for Localizing Code and Non-Code Issues in Software Projects
Zejun Zhang, Jian Wang, Qingyun Yang, Yifan Pan, Yi Tang, Yi Li, Zhenchang Xing, Tian Zhang, Xuandong Li, Guoan Zhang
TL;DR
This work presents MULocBench, a project-location benchmark for issue resolution that expands beyond code-only, PR-centric signals to include commits and resolution comments across 46 Python projects. It defines a three-stage benchmark construction, a comprehensive taxonomy of issue types and root causes, and a rich set of location signals (project, file, class, function, line) across in-project, runtime, third-party, and user-authored scopes. Empirical evaluations show that state-of-the-art localization methods and five LLM prompting strategies struggle to achieve high accuracy, with $Acc@5$ for file-level localization typically below 40%, highlighting substantial gaps between benchmark performance and real-world requirements. The paper argues for broader realism in benchmarks and provides public access to MULocBench to catalyze future advances in effective project localization for issue resolution.
Abstract
Accurate project localization (e.g., files and functions) for issue resolution is a critical first step in software maintenance. However, existing benchmarks for issue localization, such as SWE-Bench and LocBench, are limited. They focus predominantly on pull-request issues and code locations, ignoring other evidence and non-code files such as commits, comments, configurations, and documentation. To address this gap, we introduce MULocBench, a comprehensive dataset of 1,100 issues from 46 popular GitHub Python projects. Comparing with existing benchmarks, MULocBench offers greater diversity in issue types, root causes, location scopes, and file types, providing a more realistic testbed for evaluation. Using this benchmark, we assess the performance of state-of-the-art localization methods and five LLM-based prompting strategies. Our results reveal significant limitations in current techniques: even at the file level, performance metrics (Acc@5, F1) remain below 40%. This underscores the challenge of generalizing to realistic, multi-faceted issue resolution. To enable future research on project localization for issue resolution, we publicly release MULocBench at https://huggingface.co/datasets/somethingone/MULocBench.
