Table of Contents
Fetching ...

Reducing Biases in Record Matching Through Scores Calibration

Mohammad Hossein Moslemi, Mostafa Milani

TL;DR

A threshold-independent notion of score bias is introduced that extends standard group-fairness criteria-demographic parity, equal opportunity, and equalized odds from binary outputs to score functions by integrating group-wise metric gaps over all thresholds by integrating group-wise metric gaps over all thresholds.

Abstract

Record matching models typically output a real-valued matching score that is later consumed through thresholding, ranking, or human review. While fairness in record matching has mostly been assessed using binary decisions at a fixed threshold, such evaluations can miss systematic disparities in the entire score distribution and can yield conclusions that change with the chosen threshold. We introduce a threshold-independent notion of score bias that extends standard group-fairness criteria-demographic parity (DP), equal opportunity (EO), and equalized odds (EOD)-from binary outputs to score functions by integrating group-wise metric gaps over all thresholds. Using this metric, we empirically show that several state-of-the-art deep matchers can exhibit substantial score bias even when appearing fair at commonly used thresholds. To mitigate these disparities without retraining the underlying matcher, we propose two model-agnostic post-processing methods that only require score evaluations on an (unlabeled) calibration set. Calib targets DP by aligning minority/majority score distributions to a common Wasserstein barycenter via a quantile-based optimal-transport map, with finite-sample guarantees on both residual DP bias and score distortion. C-Calib extends this idea to label-dependent notions (EO/EOD) by performing barycenter alignment conditionally on an estimated label, and we characterize how its guarantees depend on both sample size and label-estimation error. Experiments on standard record-matching benchmarks and multiple neural matchers confirm that Calib and C-Calib substantially reduce score bias with minimal loss in accuracy.

Reducing Biases in Record Matching Through Scores Calibration

TL;DR

A threshold-independent notion of score bias is introduced that extends standard group-fairness criteria-demographic parity, equal opportunity, and equalized odds from binary outputs to score functions by integrating group-wise metric gaps over all thresholds by integrating group-wise metric gaps over all thresholds.

Abstract

Record matching models typically output a real-valued matching score that is later consumed through thresholding, ranking, or human review. While fairness in record matching has mostly been assessed using binary decisions at a fixed threshold, such evaluations can miss systematic disparities in the entire score distribution and can yield conclusions that change with the chosen threshold. We introduce a threshold-independent notion of score bias that extends standard group-fairness criteria-demographic parity (DP), equal opportunity (EO), and equalized odds (EOD)-from binary outputs to score functions by integrating group-wise metric gaps over all thresholds. Using this metric, we empirically show that several state-of-the-art deep matchers can exhibit substantial score bias even when appearing fair at commonly used thresholds. To mitigate these disparities without retraining the underlying matcher, we propose two model-agnostic post-processing methods that only require score evaluations on an (unlabeled) calibration set. Calib targets DP by aligning minority/majority score distributions to a common Wasserstein barycenter via a quantile-based optimal-transport map, with finite-sample guarantees on both residual DP bias and score distortion. C-Calib extends this idea to label-dependent notions (EO/EOD) by performing barycenter alignment conditionally on an estimated label, and we characterize how its guarantees depend on both sample size and label-estimation error. Experiments on standard record-matching benchmarks and multiple neural matchers confirm that Calib and C-Calib substantially reduce score bias with minimal loss in accuracy.

Paper Structure

This paper contains 28 sections, 2 theorems, 25 equations, 6 figures, 5 tables, 2 algorithms.

Key Result

Theorem 1

Let $\hat{s}$ be the calibrated score produced by Algorithm alg:calibrate and let $s^*$ denote the population barycenter score defined above. Under Assumptions 1--2, where $n=\min(n_a\xspace,n_b\xspace)$.

Figures (6)

  • Figure 1: Figure \ref{['fig:tpr']} shows variation in TPR across thresholds. The ROC of HierMatchfu2021hierarchical (Figure \ref{['fig:auc-two']}) shows that the AUC is nearly the same for both groups, with 93.33% for the minority group and 93.94% for the majority. However, there is a noticeable difference in performance at specific thresholds.
  • Figure 2: Running Algorithm \ref{['alg:calibrate']} in Example \ref{['ex:calibrate']}
  • Figure 3: Running Algorithm \ref{['alg:pcalibrate']} in Example \ref{['ex:pcalibrate']}
  • Figure 4: Variation of bias across thresholds, highlighting the limits of single-threshold fairness assessments. Figure \ref{['fig:DPcompare']} shows this for DBLP-ACM dataset with HierMatch, while Figure \ref{['fig:EOcompare']} does for AMZ-GOO dataset with DeepMatch.
  • Figure 5: Comparison of distributional demographic disparity across different models and datasets before and after Calib.
  • ...and 1 more figures

Theorems & Definitions (9)

  • Example 1
  • Definition 1: Fair matching score
  • Definition 2: FairScore
  • Example 2
  • Theorem 1
  • proof
  • Example 3
  • Theorem 2
  • proof : Proof sketch