Tests4Py: A Benchmark for System Testing

Marius Smytzek; Martin Eberlein; Batuhan Serce; Lars Grunske; Andreas Zeller

Tests4Py: A Benchmark for System Testing

Marius Smytzek, Martin Eberlein, Batuhan Serce, Lars Grunske, Andreas Zeller

TL;DR

Tests4Py addresses the lack of functional oracles and input-driven testing in Python benchmarks by providing a modular framework with per-bug oracles, system and unit test generation, and grammars. Built on the BugsInPy lineage, it includes 73 bugs from seven real-world Python projects and six example programs, each with an oracle and a harness to drive diverse tests. The benchmark enables use cases in evaluating test-generation techniques, mining input grammars, and driving automatic program repair and automated debugging, with a design that supports both handcrafted and generated tests. Open-source and extensible, Tests4Py aims to support reproducible evaluations and help advance research in testing, debugging, and program repair for Python.

Abstract

Benchmarks are among the main drivers of progress in software engineering research. However, many current benchmarks are limited by inadequate system oracles and sparse unit tests. Our Tests4Py benchmark, derived from the BugsInPy benchmark, addresses these limitations. It includes 73 bugs from seven real-world Python applications and six bugs from example programs. Each subject in Tests4Py is equipped with an oracle for verifying functional correctness and supports both system and unit test generation. This allows for comprehensive qualitative studies and extensive evaluations, making Tests4Py a cutting-edge benchmark for research in test generation, debugging, and automatic program repair.

Tests4Py: A Benchmark for System Testing

TL;DR

Abstract

Paper Structure (16 sections, 4 figures)

This paper contains 16 sections, 4 figures.

Introduction
TESTS4PY and its Benchmark
4PY Components
Oracles
Grammars
System Tests
Unit Tests
Usage
TESTS4PY Use Cases
Evaluating Test Generation
Mining Input Grammars
Driving Automatic Program Repair
Improving Automated Debugging
Threats to Validity
Related Work
...and 1 more sections

Figures (4)

Figure 1: Tests4Py Overview. Tests4Py incorporates components for generating system and unit tests, running them, and assessing their results using generic oracles.
Figure 2: The original unit test for the bug #2 as included in the project.
Figure 3: The 4PY interface (simplified) provides a harness and API to execute system tests for the bug #2. The result of this function gets directly provided to the oracle.
Figure 4: The 4PY oracle (excerpt and abstracted) for bug #2, used to validate system tests, checks for generic issues. The input itself describes the , , and .

Tests4Py: A Benchmark for System Testing

TL;DR

Abstract

Tests4Py: A Benchmark for System Testing

Authors

TL;DR

Abstract

Table of Contents

Figures (4)