Table of Contents
Fetching ...

Tests4Py: A Benchmark for System Testing

Marius Smytzek, Martin Eberlein, Batuhan Serce, Lars Grunske, Andreas Zeller

TL;DR

Tests4Py addresses the lack of functional oracles and input-driven testing in Python benchmarks by providing a modular framework with per-bug oracles, system and unit test generation, and grammars. Built on the BugsInPy lineage, it includes 73 bugs from seven real-world Python projects and six example programs, each with an oracle and a harness to drive diverse tests. The benchmark enables use cases in evaluating test-generation techniques, mining input grammars, and driving automatic program repair and automated debugging, with a design that supports both handcrafted and generated tests. Open-source and extensible, Tests4Py aims to support reproducible evaluations and help advance research in testing, debugging, and program repair for Python.

Abstract

Benchmarks are among the main drivers of progress in software engineering research. However, many current benchmarks are limited by inadequate system oracles and sparse unit tests. Our Tests4Py benchmark, derived from the BugsInPy benchmark, addresses these limitations. It includes 73 bugs from seven real-world Python applications and six bugs from example programs. Each subject in Tests4Py is equipped with an oracle for verifying functional correctness and supports both system and unit test generation. This allows for comprehensive qualitative studies and extensive evaluations, making Tests4Py a cutting-edge benchmark for research in test generation, debugging, and automatic program repair.

Tests4Py: A Benchmark for System Testing

TL;DR

Tests4Py addresses the lack of functional oracles and input-driven testing in Python benchmarks by providing a modular framework with per-bug oracles, system and unit test generation, and grammars. Built on the BugsInPy lineage, it includes 73 bugs from seven real-world Python projects and six example programs, each with an oracle and a harness to drive diverse tests. The benchmark enables use cases in evaluating test-generation techniques, mining input grammars, and driving automatic program repair and automated debugging, with a design that supports both handcrafted and generated tests. Open-source and extensible, Tests4Py aims to support reproducible evaluations and help advance research in testing, debugging, and program repair for Python.

Abstract

Benchmarks are among the main drivers of progress in software engineering research. However, many current benchmarks are limited by inadequate system oracles and sparse unit tests. Our Tests4Py benchmark, derived from the BugsInPy benchmark, addresses these limitations. It includes 73 bugs from seven real-world Python applications and six bugs from example programs. Each subject in Tests4Py is equipped with an oracle for verifying functional correctness and supports both system and unit test generation. This allows for comprehensive qualitative studies and extensive evaluations, making Tests4Py a cutting-edge benchmark for research in test generation, debugging, and automatic program repair.
Paper Structure (16 sections, 4 figures)

This paper contains 16 sections, 4 figures.

Figures (4)

  • Figure 1: Tests4Py Overview. Tests4Py incorporates components for generating system and unit tests, running them, and assessing their results using generic oracles.
  • Figure 2: The original unit test for the bug #2 as included in the project.
  • Figure 3: The 4PY interface (simplified) provides a harness and API to execute system tests for the bug #2. The result of this function gets directly provided to the oracle.
  • Figure 4: The 4PY oracle (excerpt and abstracted) for bug #2, used to validate system tests, checks for generic issues. The input itself describes the , , and .