Towards Scalable Generation of Realistic Test Data for Duplicate Detection

Fabian Panse; Wolfram Wingerath; Benjamin Wollmer

Towards Scalable Generation of Realistic Test Data for Duplicate Detection

Fabian Panse, Wolfram Wingerath, Benjamin Wollmer

TL;DR

The paper tackles the need for scalable, realistic test data for evaluating duplicate detection in the era of big and diverse data sources. It presents DaPo$^+$, a six-phase data-generation pipeline that profiles inputs, automatically configures generation parameters, creates data histories, and emits multi-source datasets with complex schemas and temporal errors via an event-based model. Key contributions include extending support to non-relational data, enabling automatic high-level configuration, and incorporating data histories and copying processes to simulate outdated values and inter-source dependencies. This approach promises more realistic benchmarking for duplicate detection across data cleaning, integration, and linkage tasks, with practical impact for evaluating scalable algorithms in real-world settings.

Abstract

Due to the increasing volume, volatility, and diversity of data in virtually all areas of our lives, the ability to detect duplicates in potentially linked data sources is more important than ever before. However, while research is already intensively engaged in adapting duplicate detection algorithms to the changing circumstances, existing test data generators are still designed for small -- mostly relational -- datasets and can thus fulfill their intended task only to a limited extent. In this report, we present our ongoing research on a novel approach for test data generation that -- in contrast to existing solutions -- is able to produce large test datasets with complex schemas and more realistic error patterns while being easy to use for inexperienced users.

Towards Scalable Generation of Realistic Test Data for Duplicate Detection

TL;DR

The paper tackles the need for scalable, realistic test data for evaluating duplicate detection in the era of big and diverse data sources. It presents DaPo

, a six-phase data-generation pipeline that profiles inputs, automatically configures generation parameters, creates data histories, and emits multi-source datasets with complex schemas and temporal errors via an event-based model. Key contributions include extending support to non-relational data, enabling automatic high-level configuration, and incorporating data histories and copying processes to simulate outdated values and inter-source dependencies. This approach promises more realistic benchmarking for duplicate detection across data cleaning, integration, and linkage tasks, with practical impact for evaluating scalable algorithms in real-world settings.

Abstract

Paper Structure (10 sections, 3 figures)

This paper contains 10 sections, 3 figures.

Introduction
Application Contexts
State of the Art & Related Work
Test Data Generation in Six Phases
Challenges & Ongoing Research
Data Profiling
Automatic Preconfiguration
Generation and Reuse of Data Histories
Source Creation & Pollution
Conclusion

Figures (3)

Figure 1: Basic architecture of DaPo$^+$
Figure 2: Sample hierarchy of error parameters. Automated mappings from the few abstract high-level parameters to the many low-level parameters support inexperienced users in configuring the actual error probabilities.
Figure 3: Event-based error model with three data sources. Events are inserts, updates, and deletes of records, copying processes, and changes in the schema and error profiles.

Towards Scalable Generation of Realistic Test Data for Duplicate Detection

TL;DR

Abstract

Towards Scalable Generation of Realistic Test Data for Duplicate Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (3)