Table of Contents
Fetching ...

Efficient Model Repository for Entity Resolution: Construction, Search, and Integration

Victor Christen, Peter Christen

TL;DR

MoRER is the first method for building a model repository for ER problems, facilitating the continuous integration of new data sources by reducing the need for generating new training data.

Abstract

Entity resolution (ER) is a fundamental task in data integration that enables insights from heterogeneous data sources. The primary challenge of ER lies in classifying record pairs as matches or non-matches, which in multi-source ER (MS-ER) scenarios can become complicated due to data source heterogeneity and scalability issues. Existing methods for MS-ER generally require labeled record pairs, and such methods fail to effectively reuse models across multiple ER tasks. We propose MoRER (Model Repositories for Entity Resolution), a novel method for building a model repository consisting of classification models that solve ER problems. By leveraging feature distribution analysis, MoRER clusters similar ER tasks, thereby enabling the effective initialization of a model repository with a moderate labeling effort. Experimental results on three multi-source datasets demonstrate that MoRER achieves comparable or better results to methods that have label-limited budgets, such as active learning and transfer learning approaches, while outperforming self-supervised approaches that utilize large pre-trained language models. When compared to supervised transformer-based methods, MoRER achieves comparable or better results, depending on the training data size. Importantly, MoRER is the first method for building a model repository for ER problems, facilitating the continuous integration of new data sources by reducing the need for generating new training data.

Efficient Model Repository for Entity Resolution: Construction, Search, and Integration

TL;DR

MoRER is the first method for building a model repository for ER problems, facilitating the continuous integration of new data sources by reducing the need for generating new training data.

Abstract

Entity resolution (ER) is a fundamental task in data integration that enables insights from heterogeneous data sources. The primary challenge of ER lies in classifying record pairs as matches or non-matches, which in multi-source ER (MS-ER) scenarios can become complicated due to data source heterogeneity and scalability issues. Existing methods for MS-ER generally require labeled record pairs, and such methods fail to effectively reuse models across multiple ER tasks. We propose MoRER (Model Repositories for Entity Resolution), a novel method for building a model repository consisting of classification models that solve ER problems. By leveraging feature distribution analysis, MoRER clusters similar ER tasks, thereby enabling the effective initialization of a model repository with a moderate labeling effort. Experimental results on three multi-source datasets demonstrate that MoRER achieves comparable or better results to methods that have label-limited budgets, such as active learning and transfer learning approaches, while outperforming self-supervised approaches that utilize large pre-trained language models. When compared to supervised transformer-based methods, MoRER achieves comparable or better results, depending on the training data size. Importantly, MoRER is the first method for building a model repository for ER problems, facilitating the continuous integration of new data sources by reducing the need for generating new training data.

Paper Structure

This paper contains 18 sections, 14 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Motivation of reusing solved er tasks for new tasks. The data sources $D_1$ and $D_2$ are already linked utilizing similarity feature vectors and a model $M_{1,2}$ to label each record pair. The question is whether the derived model $M_{1,2}$ can also be applied to the new data source, $D_3$, to match it to $D_1$ and $D_2$, or if new models have to be generated.
  • Figure 2: Example of the similarity distributions using the Jaccard similarity for title considering the ER problems in the WDC-computer data set. Each of the five lines of different colors represents an er problem.
  • Figure 3: Workflow for initializing and using an er model repository consists of the steps: 1. Similarity Distribution Analysis, 2. er Problem Clustering, 3. Model Generation, 4. Process new er problems, 5. Classification.
  • Figure 4: Example of integrating a new er problem $p_{3,5}$. The grey colored er problems represent problems of $T$.
  • Figure 5: Linkage quality comparison of MoRER to Almser standalone, Sudowoodo, AnyMatch, TransER, and Ditto.
  • ...and 3 more figures

Theorems & Definitions (1)

  • Definition 1: Entity Resolution Model Search