Table of Contents
Fetching ...

Evaluating Blocking Biases in Entity Matching

Mohammad Hossein Moslemi, Harini Balamurugan, Mostafa Milani

TL;DR

The paper tackles fairness in the blocking stage of Entity Matching (EM) by extending traditional blocking metrics (RR, PC, and their harmonic mean) to per-group variants and defining disparities to quantify biases. It performs extensive experiments across seven EM benchmarks with multiple blocking methods, showing that blocking bias can propagate to final EM performance even with readily effective reducers. The study reveals that no single blocking method consistently minimizes bias across all datasets, and removing sensitive attributes does not reliably eliminate disparities due to correlated features. The findings underscore the need for debiasing strategies targeted at blocking and the end-to-end EM pipeline, and the authors provide their experimental code for reproducibility.

Abstract

Entity Matching (EM) is crucial for identifying equivalent data entities across different sources, a task that becomes increasingly challenging with the growth and heterogeneity of data. Blocking techniques, which reduce the computational complexity of EM, play a vital role in making this process scalable. Despite advancements in blocking methods, the issue of fairness; where blocking may inadvertently favor certain demographic groups; has been largely overlooked. This study extends traditional blocking metrics to incorporate fairness, providing a framework for assessing bias in blocking techniques. Through experimental analysis, we evaluate the effectiveness and fairness of various blocking methods, offering insights into their potential biases. Our findings highlight the importance of considering fairness in EM, particularly in the blocking phase, to ensure equitable outcomes in data integration tasks.

Evaluating Blocking Biases in Entity Matching

TL;DR

The paper tackles fairness in the blocking stage of Entity Matching (EM) by extending traditional blocking metrics (RR, PC, and their harmonic mean) to per-group variants and defining disparities to quantify biases. It performs extensive experiments across seven EM benchmarks with multiple blocking methods, showing that blocking bias can propagate to final EM performance even with readily effective reducers. The study reveals that no single blocking method consistently minimizes bias across all datasets, and removing sensitive attributes does not reliably eliminate disparities due to correlated features. The findings underscore the need for debiasing strategies targeted at blocking and the end-to-end EM pipeline, and the authors provide their experimental code for reproducibility.

Abstract

Entity Matching (EM) is crucial for identifying equivalent data entities across different sources, a task that becomes increasingly challenging with the growth and heterogeneity of data. Blocking techniques, which reduce the computational complexity of EM, play a vital role in making this process scalable. Despite advancements in blocking methods, the issue of fairness; where blocking may inadvertently favor certain demographic groups; has been largely overlooked. This study extends traditional blocking metrics to incorporate fairness, providing a framework for assessing bias in blocking techniques. Through experimental analysis, we evaluate the effectiveness and fairness of various blocking methods, offering insights into their potential biases. Our findings highlight the importance of considering fairness in EM, particularly in the blocking phase, to ensure equitable outcomes in data integration tasks.
Paper Structure (19 sections, 3 equations, 4 figures, 10 tables)

This paper contains 19 sections, 3 equations, 4 figures, 10 tables.

Figures (4)

  • Figure 1: Disparity in blocking: minority and majority entities are highlighted in blue and red respectively, and the equivalent pairs are linked by dotted lines. Solid lines show the blocks.
  • Figure 2: Runtime of blocking methods
  • Figure 3: Impact of removing sensitives on methods
  • Figure 4: Impact of removing sensitives in datasets

Theorems & Definitions (3)

  • Definition 3.1: Blocking
  • Example 3.2
  • Example 4.1