Table of Contents
Fetching ...

A Benchmark for Localizing Code and Non-Code Issues in Software Projects

Zejun Zhang, Jian Wang, Qingyun Yang, Yifan Pan, Yi Tang, Yi Li, Zhenchang Xing, Tian Zhang, Xuandong Li, Guoan Zhang

TL;DR

This work presents MULocBench, a project-location benchmark for issue resolution that expands beyond code-only, PR-centric signals to include commits and resolution comments across 46 Python projects. It defines a three-stage benchmark construction, a comprehensive taxonomy of issue types and root causes, and a rich set of location signals (project, file, class, function, line) across in-project, runtime, third-party, and user-authored scopes. Empirical evaluations show that state-of-the-art localization methods and five LLM prompting strategies struggle to achieve high accuracy, with $Acc@5$ for file-level localization typically below 40%, highlighting substantial gaps between benchmark performance and real-world requirements. The paper argues for broader realism in benchmarks and provides public access to MULocBench to catalyze future advances in effective project localization for issue resolution.

Abstract

Accurate project localization (e.g., files and functions) for issue resolution is a critical first step in software maintenance. However, existing benchmarks for issue localization, such as SWE-Bench and LocBench, are limited. They focus predominantly on pull-request issues and code locations, ignoring other evidence and non-code files such as commits, comments, configurations, and documentation. To address this gap, we introduce MULocBench, a comprehensive dataset of 1,100 issues from 46 popular GitHub Python projects. Comparing with existing benchmarks, MULocBench offers greater diversity in issue types, root causes, location scopes, and file types, providing a more realistic testbed for evaluation. Using this benchmark, we assess the performance of state-of-the-art localization methods and five LLM-based prompting strategies. Our results reveal significant limitations in current techniques: even at the file level, performance metrics (Acc@5, F1) remain below 40%. This underscores the challenge of generalizing to realistic, multi-faceted issue resolution. To enable future research on project localization for issue resolution, we publicly release MULocBench at https://huggingface.co/datasets/somethingone/MULocBench.

A Benchmark for Localizing Code and Non-Code Issues in Software Projects

TL;DR

This work presents MULocBench, a project-location benchmark for issue resolution that expands beyond code-only, PR-centric signals to include commits and resolution comments across 46 Python projects. It defines a three-stage benchmark construction, a comprehensive taxonomy of issue types and root causes, and a rich set of location signals (project, file, class, function, line) across in-project, runtime, third-party, and user-authored scopes. Empirical evaluations show that state-of-the-art localization methods and five LLM prompting strategies struggle to achieve high accuracy, with for file-level localization typically below 40%, highlighting substantial gaps between benchmark performance and real-world requirements. The paper argues for broader realism in benchmarks and provides public access to MULocBench to catalyze future advances in effective project localization for issue resolution.

Abstract

Accurate project localization (e.g., files and functions) for issue resolution is a critical first step in software maintenance. However, existing benchmarks for issue localization, such as SWE-Bench and LocBench, are limited. They focus predominantly on pull-request issues and code locations, ignoring other evidence and non-code files such as commits, comments, configurations, and documentation. To address this gap, we introduce MULocBench, a comprehensive dataset of 1,100 issues from 46 popular GitHub Python projects. Comparing with existing benchmarks, MULocBench offers greater diversity in issue types, root causes, location scopes, and file types, providing a more realistic testbed for evaluation. Using this benchmark, we assess the performance of state-of-the-art localization methods and five LLM-based prompting strategies. Our results reveal significant limitations in current techniques: even at the file level, performance metrics (Acc@5, F1) remain below 40%. This underscores the challenge of generalizing to realistic, multi-faceted issue resolution. To enable future research on project localization for issue resolution, we publicly release MULocBench at https://huggingface.co/datasets/somethingone/MULocBench.

Paper Structure

This paper contains 23 sections, 16 figures, 2 tables.

Figures (16)

  • Figure 1: MULocBench Construction Overview
  • Figure 2: Four issues illustrating different issue types, reasons, location scopes and types.
  • Figure 3: Issue number comparison between MULocBench, SWE-Bench Lite and LocBench
  • Figure 4: Performance comparison of different issue types on MULocBench with Python files.
  • Figure 5: Performance comparison of different issue reasons on MULocBench with Python files.
  • ...and 11 more figures