PerfCurator: Curating a large-scale dataset of performance bug-related commits from public repositories
Md Abul Kalam Azad, Manoj Alexender, Matthew Alexender, Syed Salauddin Mohammad Tariq, Foyzul Hassan, Probir Roy
TL;DR
PerfCurator tackles the lack of large-scale, open datasets of performance bugs by integrating a lightweight, distillation-based classifier with scalable repository mining. The authors develop PcBERT-HS and PcBERT-KD to identify performance bug commits, achieving state-of-the-art accuracy with far lower computational cost than large LLMs. They deploy PerfCurator on a 50-node CPU cluster to mine hundreds of thousands of commits across Python, C++, and Java, producing a dataset of over 408K performance commits. The study demonstrates that larger, data-rich datasets enhance data-driven performance bug detection and API-misuse detection, offering practical benefits for performance engineering and tooling. Overall, the work provides a scalable workflow and dataset to advance research on performance bugs and their mitigation.
Abstract
Performance bugs challenge software development, degrading performance and wasting computational resources. Software developers invest substantial effort in addressing these issues. Curating these performance bugs can offer valuable insights to the software engineering research community, aiding in developing new mitigation strategies. However, there is no large-scale open-source performance bugs dataset available. To bridge this gap, we propose PerfCurator, a repository miner that collects performance bug-related commits at scale. PerfCurator employs PcBERT-KD, a 125M parameter BERT model trained to classify performance bug-related commits. Our evaluation shows PcBERT-KD achieves accuracy comparable to 7 billion parameter LLMs but with significantly lower computational overhead, enabling cost-effective deployment on CPU clusters. Utilizing PcBERT-KD as the core component, we deployed PerfCurator on a 50-node CPU cluster to mine GitHub repositories. This extensive mining operation resulted in the construction of a large-scale dataset comprising 114K performance bug-fix commits in Python, 217.9K in C++, and 76.6K in Java. Our results demonstrate that this large-scale dataset significantly enhances the effectiveness of data-driven performance bug detection systems.
