PerfCurator: Curating a large-scale dataset of performance bug-related commits from public repositories

Md Abul Kalam Azad; Manoj Alexender; Matthew Alexender; Syed Salauddin Mohammad Tariq; Foyzul Hassan; Probir Roy

PerfCurator: Curating a large-scale dataset of performance bug-related commits from public repositories

Md Abul Kalam Azad, Manoj Alexender, Matthew Alexender, Syed Salauddin Mohammad Tariq, Foyzul Hassan, Probir Roy

TL;DR

PerfCurator tackles the lack of large-scale, open datasets of performance bugs by integrating a lightweight, distillation-based classifier with scalable repository mining. The authors develop PcBERT-HS and PcBERT-KD to identify performance bug commits, achieving state-of-the-art accuracy with far lower computational cost than large LLMs. They deploy PerfCurator on a 50-node CPU cluster to mine hundreds of thousands of commits across Python, C++, and Java, producing a dataset of over 408K performance commits. The study demonstrates that larger, data-rich datasets enhance data-driven performance bug detection and API-misuse detection, offering practical benefits for performance engineering and tooling. Overall, the work provides a scalable workflow and dataset to advance research on performance bugs and their mitigation.

Abstract

Performance bugs challenge software development, degrading performance and wasting computational resources. Software developers invest substantial effort in addressing these issues. Curating these performance bugs can offer valuable insights to the software engineering research community, aiding in developing new mitigation strategies. However, there is no large-scale open-source performance bugs dataset available. To bridge this gap, we propose PerfCurator, a repository miner that collects performance bug-related commits at scale. PerfCurator employs PcBERT-KD, a 125M parameter BERT model trained to classify performance bug-related commits. Our evaluation shows PcBERT-KD achieves accuracy comparable to 7 billion parameter LLMs but with significantly lower computational overhead, enabling cost-effective deployment on CPU clusters. Utilizing PcBERT-KD as the core component, we deployed PerfCurator on a 50-node CPU cluster to mine GitHub repositories. This extensive mining operation resulted in the construction of a large-scale dataset comprising 114K performance bug-fix commits in Python, 217.9K in C++, and 76.6K in Java. Our results demonstrate that this large-scale dataset significantly enhances the effectiveness of data-driven performance bug detection systems.

PerfCurator: Curating a large-scale dataset of performance bug-related commits from public repositories

TL;DR

Abstract

Paper Structure (54 sections, 8 equations, 7 figures, 9 tables)

This paper contains 54 sections, 8 equations, 7 figures, 9 tables.

Introduction
Paper Contributions
Related Work
Software performance bugs
Performance monitoring and analysis tools
Methodology
Ground Truth Construction
Keyword-Filtering: The Baseline
RQ1: Large Language Models for Classification
Motivation
Mistral-7B
Prompt Template
Observation
Implication
RQ2: Training Small Language Models
...and 39 more sections

Figures (7)

Figure 1: Throughput vs Accuracy for various approaches of commit message classification
Figure 2: Accuracy of API misuse detection improved with increased performance commit data points collected by PerfCurator.
Figure 3: Workflow of PerfCurator pipeline
Figure 4: Prompt template. In our zero-shot experimental setting, we provide label descriptions. (with temperature = 0)
Figure 5: Heuristic supervision-based learning process. Each LFs are 125M parameter BERT model, trained on a dataset labeled by a regular expression pattern
...and 2 more figures

PerfCurator: Curating a large-scale dataset of performance bug-related commits from public repositories

TL;DR

Abstract

PerfCurator: Curating a large-scale dataset of performance bug-related commits from public repositories

Authors

TL;DR

Abstract

Table of Contents

Figures (7)