BUGSPHP: A dataset for Automated Program Repair in PHP
K. D. Pramod, W. T. N. De Silva, W. U. K. Thabrew, Ridwan Shariffdeen, Sandareka Wickramanayake
TL;DR
PHP has long been a dominant server-side language, yet lacks a standardized bug benchmark for automated program repair. BugsPHP introduces a large-scale PHP bug dataset with 653,606 training bug-fix commits and 513 manually validated test bugs, gathered from 4,483 and 15 repositories respectively, and validated through developer tests and dynamic analysis. The paper details a rigorous data collection and validation pipeline, analyzes patch characteristics and bug types, and provides preliminary evaluation of two learning-based APR models on the PHP dataset. Findings show partial success for existing APR approaches on PHP, underscoring the need for PHP-specific repair techniques and richer test coverage. BugsPHP is publicly available and poised to accelerate research in PHP program repair, testing, and analysis.
Abstract
Automated Program Repair (APR) improves developer productivity by saving debugging and bug-fixing time. While APR has been extensively explored for C/C++ and Java programs, there is little research on bugs in PHP programs due to the lack of a benchmark PHP bug dataset. This is surprising given that PHP has been one of the most widely used server-side languages for over two decades, being used in a variety of contexts such as e-commerce, social networking, and content management. This paper presents a benchmark dataset of PHP bugs on real-world applications called BUGSPHP, which can enable research on analysis, testing, and repair for PHP programs. The dataset consists of training and test datasets, separately curated from GitHub and processed locally. The training dataset includes more than 600,000 bug-fixing commits. The test dataset contains 513 manually validated bug-fixing commits equipped with developer-provided test cases to assess patch correctness.
