Table of Contents
Fetching ...

Graph-based Active Learning for Entity Cluster Repair

Victor Christen, Daniel Obraczka, Marvin Hofer, Martin Franke, Erhard Rahm

TL;DR

This work tackles cluster repair in multi-source entity resolution where duplicates complicate traditional duplicate-free assumptions. It introduces GraphCR, a graph-metric–driven framework that trains an edge classifier using features derived from similarity graphs and network structure, coupled with cluster-aware active learning to efficiently generate training data. The method iteratively repairs clusters by removing non-matching edges and merging records under a support-driven scheme, achieving superior F1-scores on MusicBrainz and Dexter compared with existing repair methods, especially under budgets that handle duplicates. The approach demonstrates robustness to noisy similarities and offers practical impact for reliable knowledge graph construction from heterogeneous, real-world data sources.

Abstract

Cluster repair methods aim to determine errors in clusters and modify them so that each cluster consists of records representing the same entity. Current cluster repair methodologies primarily assume duplicate-free data sources, where each record from one source corresponds to a unique record from another. However, real-world data often deviates from this assumption due to quality issues. Recent approaches apply clustering methods in combination with link categorization methods so they can be applied to data sources with duplicates. Nevertheless, the results do not show a clear picture since the quality highly varies depending on the configuration and dataset. In this study, we introduce a novel approach for cluster repair that utilizes graph metrics derived from the underlying similarity graphs. These metrics are pivotal in constructing a classification model to distinguish between correct and incorrect edges. To address the challenge of limited training data, we integrate an active learning mechanism tailored to cluster-specific attributes. The evaluation shows that the method outperforms existing cluster repair methods without distinguishing between duplicate-free or dirty data sources. Notably, our modified active learning strategy exhibits enhanced performance when dealing with datasets containing duplicates, showcasing its effectiveness in such scenarios.

Graph-based Active Learning for Entity Cluster Repair

TL;DR

This work tackles cluster repair in multi-source entity resolution where duplicates complicate traditional duplicate-free assumptions. It introduces GraphCR, a graph-metric–driven framework that trains an edge classifier using features derived from similarity graphs and network structure, coupled with cluster-aware active learning to efficiently generate training data. The method iteratively repairs clusters by removing non-matching edges and merging records under a support-driven scheme, achieving superior F1-scores on MusicBrainz and Dexter compared with existing repair methods, especially under budgets that handle duplicates. The approach demonstrates robustness to noisy similarities and offers practical impact for reliable knowledge graph construction from heterogeneous, real-world data sources.

Abstract

Cluster repair methods aim to determine errors in clusters and modify them so that each cluster consists of records representing the same entity. Current cluster repair methodologies primarily assume duplicate-free data sources, where each record from one source corresponds to a unique record from another. However, real-world data often deviates from this assumption due to quality issues. Recent approaches apply clustering methods in combination with link categorization methods so they can be applied to data sources with duplicates. Nevertheless, the results do not show a clear picture since the quality highly varies depending on the configuration and dataset. In this study, we introduce a novel approach for cluster repair that utilizes graph metrics derived from the underlying similarity graphs. These metrics are pivotal in constructing a classification model to distinguish between correct and incorrect edges. To address the challenge of limited training data, we integrate an active learning mechanism tailored to cluster-specific attributes. The evaluation shows that the method outperforms existing cluster repair methods without distinguishing between duplicate-free or dirty data sources. Notably, our modified active learning strategy exhibits enhanced performance when dealing with datasets containing duplicates, showcasing its effectiveness in such scenarios.
Paper Structure (15 sections, 1 equation, 7 figures, 2 tables)

This paper contains 15 sections, 1 equation, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Outline of the complete entity resolution process including the repair method identifying incorrect edges $\mathbf{E_{NM}}$ to construct repaired clusters $\mathbf{C_{rep}}$.
  • Figure 2: Overview of the graph-based cluster repair method.
  • Figure 3: Example of the iterative cluster repair procedure showing 6 records of an initial cluster. The dashed red line is an edge being classified as non-match.
  • Figure 4: Results on Music Brainz and Dexter(C0, C50, C100) datasets with different duplicate ratios considering the basic selection strategy (bootstrap) and the cluster-specific selection (bootstrap ext) in the active learning step.
  • Figure 5: F1-score results of our proposed approach (GraphCR) as compared with the other repair methods CLIP saaedi2018famer, affinity propagation clustering (MSCD-AP) LermSR2021 as well as agglomerative hierarchical clustering methods SaeediDR2021 with the different variations regarding the merging step single (MSCD S-LINK), complete (MSCD C-LINK) and average (MSCD A-LINK).
  • ...and 2 more figures