Table of Contents
Fetching ...

How to Evaluate Entity Resolution Systems: An Entity-Centric Framework with Application to Inventor Name Disambiguation

Olivier Binette, Youngsoo Baek, Siddharth Engineer, Christina Jones, Abel Dasylva, Jerome P. Reiter

TL;DR

This paper tackles the challenge of evaluating entity resolution ER systems by proposing an entity-centric framework that leverages ground-truth, fully resolved clusters to estimate global performance metrics and diagnose error modes. It introduces a data labeling methodology to build reusable benchmark datasets, along with summary statistics for monitoring, and cluster- and record-level error metrics that support comprehensive error analysis. The authors derive representation lemmas showing that common ER performance metrics such as pairwise and B-Cubed precision/recall can be expressed as ratio estimators over cluster-level quantities, enabling efficient estimation from a simple labeled sample and propagation of labeling uncertainty. Through application to PatentsView inventor disambiguation and extensive simulations, they demonstrate that sampling clusters with probability proportional to size yields accurate metric estimates with practical sample sizes, while maintaining a flexible, black-box evaluation workflow that can accommodate additional metrics and application-specific requirements.

Abstract

Entity resolution (record linkage, microclustering) systems are notoriously difficult to evaluate. Looking for a needle in a haystack, traditional evaluation methods use sophisticated, application-specific sampling schemes to find matching pairs of records among an immense number of non-matches. We propose an alternative that facilitates the creation of representative, reusable benchmark data sets without necessitating complex sampling schemes. These benchmark data sets can then be used for model training and a variety of evaluation tasks. Specifically, we propose an entity-centric data labeling methodology that integrates with a unified framework for monitoring summary statistics, estimating key performance metrics such as cluster and pairwise precision and recall, and analyzing root causes for errors. We validate the framework in an application to inventor name disambiguation and through simulation studies. Software: https://github.com/OlivierBinette/er-evaluation/

How to Evaluate Entity Resolution Systems: An Entity-Centric Framework with Application to Inventor Name Disambiguation

TL;DR

This paper tackles the challenge of evaluating entity resolution ER systems by proposing an entity-centric framework that leverages ground-truth, fully resolved clusters to estimate global performance metrics and diagnose error modes. It introduces a data labeling methodology to build reusable benchmark datasets, along with summary statistics for monitoring, and cluster- and record-level error metrics that support comprehensive error analysis. The authors derive representation lemmas showing that common ER performance metrics such as pairwise and B-Cubed precision/recall can be expressed as ratio estimators over cluster-level quantities, enabling efficient estimation from a simple labeled sample and propagation of labeling uncertainty. Through application to PatentsView inventor disambiguation and extensive simulations, they demonstrate that sampling clusters with probability proportional to size yields accurate metric estimates with practical sample sizes, while maintaining a flexible, black-box evaluation workflow that can accommodate additional metrics and application-specific requirements.

Abstract

Entity resolution (record linkage, microclustering) systems are notoriously difficult to evaluate. Looking for a needle in a haystack, traditional evaluation methods use sophisticated, application-specific sampling schemes to find matching pairs of records among an immense number of non-matches. We propose an alternative that facilitates the creation of representative, reusable benchmark data sets without necessitating complex sampling schemes. These benchmark data sets can then be used for model training and a variety of evaluation tasks. Specifically, we propose an entity-centric data labeling methodology that integrates with a unified framework for monitoring summary statistics, estimating key performance metrics such as cluster and pairwise precision and recall, and analyzing root causes for errors. We validate the framework in an application to inventor name disambiguation and through simulation studies. Software: https://github.com/OlivierBinette/er-evaluation/
Paper Structure (32 sections, 5 theorems, 33 equations, 11 figures, 1 table)

This paper contains 32 sections, 5 theorems, 33 equations, 11 figures, 1 table.

Key Result

Lemma 1

Suppose we sample one cluster $c$ from $\mathcal{C}$ at random. Let $p_c >0$ be proportional to its sampling probability. Then

Figures (11)

  • Figure 1: Diagram representation of the main elements of the framework and their dependencies. The entity resolution system and its predictions live in what we call the "prediction space." Monitoring statistics can be computed for predictions, while their true value for a ground truth clustering can be estimated using labeled data. In the labeled data space, cluster-wise error metrics are obtained from external benchmark datasets or from our data labeling methodology. Error analysis and performance estimates rely on cluster-wise error metrics.
  • Figure 2: Left: Estimated number of citations to PatentsView in academic literature by year. Right: Number of citations by estimated Dewey Decimal Broad Classification. The estimated citation numbers were obtained by searching Google Scholar for mentions to "PatentsView" and "Patents View" and reviewing all results, with the 2023 year estimate containing extrapolated counts for November and December. The Dewey Decimal Classification categories were obtained by extracting abstracts from papers and programmatically querying openAI's GPT-3.5 model for a classification estimate. Note that GPT-3.5 could not ascertain the classification code for 189 papers.
  • Figure 3: Screenshot of the Streamlit app used for clerical error review. The "inventor_id" field at the top of the page selects a ground truth cluster whose label is derived from the predicted cluster used as a starting point. The table below shows how this ground truth cluster (first horizontal column) relates to predicted clusters it intersects with on the x axis, with each point representing an inventor mention. Observations regarding overclustering and underclustering errors are recorded below.
  • Figure 4: Blue line: Summary statistics for PatentsView's history of predicted disambiguations. Black dotted line: Estimates of the true value of the summary statistics, based on the 2022 inventors' benchmark data, with pointwise $95\%$ confidence intervals.
  • Figure 5: Summary statistics and estimates for the fixed data set of inventor mentions dating up to August 2017. Disambiguations of this fixed data set have changed over time, as changes to the algorithm were made and since information from additional records was used to resolve entities. As before, the dotted line is the estimate of the summary value for the true clustering of the August 2017 inventor mentions. The shaded bands are pointwise $95\%$ confidence intervals. Since the data set is fixed in this case, the estimates are constant over time.
  • ...and 6 more figures

Theorems & Definitions (17)

  • Definition 1: Cluster Size Distribution Statistics
  • Definition 2: Variation and Homonymy Rate Statistics
  • Remark 1
  • Definition 3: Record-Wise Error Metrics
  • Definition 4: Cluster-Wise Error Metrics
  • Remark 2
  • Remark 3
  • Lemma 1
  • Lemma 2
  • Lemma 3
  • ...and 7 more