From Bugs to Benchmarks: A Comprehensive Survey of Software Defect Datasets
Hao-Nan Zhu, Robert M. Furth, Michael Pradel, Cindy Rubio-González
TL;DR
This survey tackles the problem of navigating the rapidly expanding landscape of software defect datasets by providing a comprehensive, systematic review of 151 datasets along four axes: scope, construction, availability/usability, and actual usage. It adopts a dual-method literature search, rigorous screening, and manual annotation, underpinned by an open taxonomy to categorize datasets and citing practices. The work highlights opportunities for broader domain and defect-type coverage, finer-grained defect isolation and annotation, standardized dataset organization, and robust reproducibility and evolution practices, complemented by an interactive portal for dataset discovery. The findings underscore the practical impact of defect datasets for empirical software engineering, benchmarking, and AI-driven software development, while calling for sustainable maintenance to ensure long-term usability and relevance.
Abstract
Software defect datasets, which are collections of software bugs and their associated information, are essential resources for researchers and practitioners in software engineering and beyond. Such datasets facilitate empirical research and enable standardized benchmarking for a wide range of techniques, including fault detection, fault localization, test generation, test prioritization, automated program repair, and emerging areas like agentic AI-based software development. Over the years, numerous software defect datasets with diverse characteristics have been developed, providing rich resources for the community, yet making it increasingly difficult to navigate the landscape. To address this challenge, this article provides a comprehensive survey of 151 software defect datasets. The survey discusses the scope of existing datasets, e.g., regarding the application domain of the buggy software, the types of defects, and the programming languages used. We also examine the construction of these datasets, including the data sources and construction methods employed. Furthermore, we assess the availability and usability of the datasets, validating their availability and examining how defects are presented. To better understand the practical uses of these datasets, we analyze the publications that cite them, revealing that the primary use cases are evaluations of new techniques and empirical research. Based on our comprehensive review of the existing datasets, this paper suggests potential opportunities for future research, including addressing underrepresented kinds of defects, enhancing availability and usability through better dataset organization, and developing more efficient strategies for dataset construction and maintenance. All surveyed datasets and their classifications are available at https://defect-datasets.github.io/.
