How to Evaluate Entity Resolution Systems: An Entity-Centric Framework with Application to Inventor Name Disambiguation
Olivier Binette, Youngsoo Baek, Siddharth Engineer, Christina Jones, Abel Dasylva, Jerome P. Reiter
TL;DR
This paper tackles the challenge of evaluating entity resolution ER systems by proposing an entity-centric framework that leverages ground-truth, fully resolved clusters to estimate global performance metrics and diagnose error modes. It introduces a data labeling methodology to build reusable benchmark datasets, along with summary statistics for monitoring, and cluster- and record-level error metrics that support comprehensive error analysis. The authors derive representation lemmas showing that common ER performance metrics such as pairwise and B-Cubed precision/recall can be expressed as ratio estimators over cluster-level quantities, enabling efficient estimation from a simple labeled sample and propagation of labeling uncertainty. Through application to PatentsView inventor disambiguation and extensive simulations, they demonstrate that sampling clusters with probability proportional to size yields accurate metric estimates with practical sample sizes, while maintaining a flexible, black-box evaluation workflow that can accommodate additional metrics and application-specific requirements.
Abstract
Entity resolution (record linkage, microclustering) systems are notoriously difficult to evaluate. Looking for a needle in a haystack, traditional evaluation methods use sophisticated, application-specific sampling schemes to find matching pairs of records among an immense number of non-matches. We propose an alternative that facilitates the creation of representative, reusable benchmark data sets without necessitating complex sampling schemes. These benchmark data sets can then be used for model training and a variety of evaluation tasks. Specifically, we propose an entity-centric data labeling methodology that integrates with a unified framework for monitoring summary statistics, estimating key performance metrics such as cluster and pairwise precision and recall, and analyzing root causes for errors. We validate the framework in an application to inventor name disambiguation and through simulation studies. Software: https://github.com/OlivierBinette/er-evaluation/
