pyMethods2Test: A Dataset of Python Tests Mapped to Focal Methods
Idriss Abdelmadjid, Robert Dyer
TL;DR
This paper introduces pyMethods2Test, the first large-scale Python dataset that maps unit tests to their focal methods, addressing the gap left by Java-centered datasets. By mining 88,846 Python projects and applying AST-based analysis, the authors identify over 22 million test methods and about 2.2 million focal methods, providing explicit test-to-method traceability in a structured JSON format and accompanying focal-context data to facilitate LLM-based test generation. The pipeline leverages Pytest and unittest, uses heuristics to locate focal files, classes, and methods, and stores data with per-repository commit hashes to preserve context. The dataset supports a range of applications from research into testing practices and design patterns to tooling for test automation and educational use, and is publicly available on Zenodo alongside scripts for generating focal context.
Abstract
Python is one of the fastest-growing programming languages and currently ranks as the top language in many lists, even recently overtaking JavaScript as the top language on GitHub. Given its importance in data science and machine learning, it is imperative to be able to effectively train LLMs to generate good unit test cases for Python code. This motivates the need for a large dataset to provide training and testing data. To date, while other large datasets exist for languages like Java, none publicly exist for Python. Python poses difficult challenges in generating such a dataset, due to its less rigid naming requirements. In this work, we consider two commonly used Python unit testing frameworks: Pytest and unittest. We analyze a large corpus of over 88K open-source GitHub projects utilizing these testing frameworks. Using a carefully designed set of heuristics, we are able to locate over 22 million test methods. We then analyze the test and non-test code and map individual unit tests to the focal method being tested. This provides an explicit traceability link from the test to the tested method. Our pyMethods2Test dataset contains over 2 million of these focal method mappings, as well as the ability to generate useful context for input to LLMs. The pyMethods2Test dataset is publicly available on Zenodo at: https://doi.org/10.5281/zenodo.14264518
