Towards Scalable Generation of Realistic Test Data for Duplicate Detection
Fabian Panse, Wolfram Wingerath, Benjamin Wollmer
TL;DR
The paper tackles the need for scalable, realistic test data for evaluating duplicate detection in the era of big and diverse data sources. It presents DaPo$^+$, a six-phase data-generation pipeline that profiles inputs, automatically configures generation parameters, creates data histories, and emits multi-source datasets with complex schemas and temporal errors via an event-based model. Key contributions include extending support to non-relational data, enabling automatic high-level configuration, and incorporating data histories and copying processes to simulate outdated values and inter-source dependencies. This approach promises more realistic benchmarking for duplicate detection across data cleaning, integration, and linkage tasks, with practical impact for evaluating scalable algorithms in real-world settings.
Abstract
Due to the increasing volume, volatility, and diversity of data in virtually all areas of our lives, the ability to detect duplicates in potentially linked data sources is more important than ever before. However, while research is already intensively engaged in adapting duplicate detection algorithms to the changing circumstances, existing test data generators are still designed for small -- mostly relational -- datasets and can thus fulfill their intended task only to a limited extent. In this report, we present our ongoing research on a novel approach for test data generation that -- in contrast to existing solutions -- is able to produce large test datasets with complex schemas and more realistic error patterns while being easy to use for inexperienced users.
