Table of Contents
Fetching ...

Progressive Entity Resolution: A Design Space Exploration

Jakub Maciejewski, Konstantinos Nikoletos, George Papadakis, Yannis Velegrakis

TL;DR

The paper tackles time- and resource-constrained entity resolution by introducing a unified Progressive ER design space built on four steps: Filtering, Weighting, Scheduling, and Matching. It shows how diverse filtering techniques (NN, Join, Blocking, Sorting) can be integrated with corresponding weighting schemes and scheduling algorithms to produce high-quality results under a defined budget $N$, quantified by progressive recall $PR@N$. Through grid-search experiments on 18 real-world datasets (10 Record Linkage, 8 Deduplication), the authors demonstrate the effectiveness and efficiency of representative configurations, and compare against state-of-the-art baselines like DeepBlocker, Sparkly, and I-PES. The findings reveal that carefully chosen combinations—e.g., NN with BFS for noisy data, blocking with CN-CBS, and sorting-based EC-ID-10-Global—deliver strong progressive recall and favorable runtime/memory footprints, offering practical guidance for cloud- or pay-as-you-go ER deployments.

Abstract

Entity Resolution (ER) is typically implemented as a batch task that processes all available data before identifying duplicate records. However, applications with time or computational constraints, e.g., those running in the cloud, require a progressive approach that produces results in a pay-as-you-go fashion. Numerous algorithms have been proposed for Progressive ER in the literature. In this work, we propose a novel framework for Progressive Entity Resolution that organizes relevant techniques into four consecutive steps: (i) filtering, which reduces the search space to the most likely candidate matches, (ii) weighting, which associates every pair of candidate matches with a similarity score, (iii) scheduling, which prioritizes the execution of the candidate matches so that the real duplicates precede the non-matching pairs, and (iv) matching, which applies a complex, matching function to the pairs in the order defined by the previous step. We associate each step with existing and novel techniques, illustrating that our framework overall generates a superset of the main existing works in the field. We select the most representative combinations resulting from our framework and fine-tune them over 10 established datasets for Record Linkage and 8 for Deduplication, with our results indicating that our taxonomy yields a wide range of high performing progressive techniques both in terms of effectiveness and time efficiency.

Progressive Entity Resolution: A Design Space Exploration

TL;DR

The paper tackles time- and resource-constrained entity resolution by introducing a unified Progressive ER design space built on four steps: Filtering, Weighting, Scheduling, and Matching. It shows how diverse filtering techniques (NN, Join, Blocking, Sorting) can be integrated with corresponding weighting schemes and scheduling algorithms to produce high-quality results under a defined budget , quantified by progressive recall . Through grid-search experiments on 18 real-world datasets (10 Record Linkage, 8 Deduplication), the authors demonstrate the effectiveness and efficiency of representative configurations, and compare against state-of-the-art baselines like DeepBlocker, Sparkly, and I-PES. The findings reveal that carefully chosen combinations—e.g., NN with BFS for noisy data, blocking with CN-CBS, and sorting-based EC-ID-10-Global—deliver strong progressive recall and favorable runtime/memory footprints, offering practical guidance for cloud- or pay-as-you-go ER deployments.

Abstract

Entity Resolution (ER) is typically implemented as a batch task that processes all available data before identifying duplicate records. However, applications with time or computational constraints, e.g., those running in the cloud, require a progressive approach that produces results in a pay-as-you-go fashion. Numerous algorithms have been proposed for Progressive ER in the literature. In this work, we propose a novel framework for Progressive Entity Resolution that organizes relevant techniques into four consecutive steps: (i) filtering, which reduces the search space to the most likely candidate matches, (ii) weighting, which associates every pair of candidate matches with a similarity score, (iii) scheduling, which prioritizes the execution of the candidate matches so that the real duplicates precede the non-matching pairs, and (iv) matching, which applies a complex, matching function to the pairs in the order defined by the previous step. We associate each step with existing and novel techniques, illustrating that our framework overall generates a superset of the main existing works in the field. We select the most representative combinations resulting from our framework and fine-tune them over 10 established datasets for Record Linkage and 8 for Deduplication, with our results indicating that our taxonomy yields a wide range of high performing progressive techniques both in terms of effectiveness and time efficiency.

Paper Structure

This paper contains 20 sections, 15 figures, 3 tables, 3 algorithms.

Figures (15)

  • Figure 1: Two clean data sources and the candidate pairs generated by a Progressive Entity Resolution approach.
  • Figure 2: Batch vs Progressive Entity Resolution.
  • Figure 3: Progressive Entity Resolution workflow
  • Figure 4: Progressive recall and recall of the best NN workflows in Table \ref{['tb:nnConf']}(a) across all budgets over selected datasets.
  • Figure 5: Progressive recall and recall of the best join workflows in Table \ref{['tb:nnConf']}(b) across all budgets over selected datasets.
  • ...and 10 more figures