Tests4Py: A Benchmark for System Testing
Marius Smytzek, Martin Eberlein, Batuhan Serce, Lars Grunske, Andreas Zeller
TL;DR
Tests4Py addresses the lack of functional oracles and input-driven testing in Python benchmarks by providing a modular framework with per-bug oracles, system and unit test generation, and grammars. Built on the BugsInPy lineage, it includes 73 bugs from seven real-world Python projects and six example programs, each with an oracle and a harness to drive diverse tests. The benchmark enables use cases in evaluating test-generation techniques, mining input grammars, and driving automatic program repair and automated debugging, with a design that supports both handcrafted and generated tests. Open-source and extensible, Tests4Py aims to support reproducible evaluations and help advance research in testing, debugging, and program repair for Python.
Abstract
Benchmarks are among the main drivers of progress in software engineering research. However, many current benchmarks are limited by inadequate system oracles and sparse unit tests. Our Tests4Py benchmark, derived from the BugsInPy benchmark, addresses these limitations. It includes 73 bugs from seven real-world Python applications and six bugs from example programs. Each subject in Tests4Py is equipped with an oracle for verifying functional correctness and supports both system and unit test generation. This allows for comprehensive qualitative studies and extensive evaluations, making Tests4Py a cutting-edge benchmark for research in test generation, debugging, and automatic program repair.
