Table of Contents
Fetching ...

GraLMatch: Matching Groups of Entities with Graphs and Language Models

Fernando De Meer Pardo, Claude Lehmann, Dennis Gehrig, Andrea Nagy, Stefano Nicoli, Branka Hadji Misheva, Martin Braschler, Kurt Stockinger

TL;DR

GraLMatch is proposed, a method that can partially detect and remove false positive pairwise predictions through graph-based properties, and fine-tuning a Transformer-based model on a reduced number of labeled samples yields a better final entity group matching than training on more samples and/or incorporating fine-tuning optimizations.

Abstract

In this paper, we present an end-to-end multi-source Entity Matching problem, which we call entity group matching, where the goal is to assign to the same group, records originating from multiple data sources but representing the same real-world entity. We focus on the effects of transitively matched records, i.e. the records connected by paths in the graph G = (V,E) whose nodes and edges represent the records and whether they are a match or not. We present a real-world instance of this problem, where the challenge is to match records of companies and financial securities originating from different data providers. We also introduce two new multi-source benchmark datasets that present similar matching challenges as real-world records. A distinctive characteristic of these records is that they are regularly updated following real-world events, but updates are not applied uniformly across data sources. This phenomenon makes the matching of certain groups of records only possible through the use of transitive information. In our experiments, we illustrate how considering transitively matched records is challenging since a limited amount of false positive pairwise match predictions can throw off the group assignment of large quantities of records. Thus, we propose GraLMatch, a method that can partially detect and remove false positive pairwise predictions through graph-based properties. Finally, we showcase how fine-tuning a Transformer-based model (DistilBERT) on a reduced number of labeled samples yields a better final entity group matching than training on more samples and/or incorporating fine-tuning optimizations, illustrating how precision becomes the deciding factor in the entity group matching of large volumes of records.

GraLMatch: Matching Groups of Entities with Graphs and Language Models

TL;DR

GraLMatch is proposed, a method that can partially detect and remove false positive pairwise predictions through graph-based properties, and fine-tuning a Transformer-based model on a reduced number of labeled samples yields a better final entity group matching than training on more samples and/or incorporating fine-tuning optimizations.

Abstract

In this paper, we present an end-to-end multi-source Entity Matching problem, which we call entity group matching, where the goal is to assign to the same group, records originating from multiple data sources but representing the same real-world entity. We focus on the effects of transitively matched records, i.e. the records connected by paths in the graph G = (V,E) whose nodes and edges represent the records and whether they are a match or not. We present a real-world instance of this problem, where the challenge is to match records of companies and financial securities originating from different data providers. We also introduce two new multi-source benchmark datasets that present similar matching challenges as real-world records. A distinctive characteristic of these records is that they are regularly updated following real-world events, but updates are not applied uniformly across data sources. This phenomenon makes the matching of certain groups of records only possible through the use of transitive information. In our experiments, we illustrate how considering transitively matched records is challenging since a limited amount of false positive pairwise match predictions can throw off the group assignment of large quantities of records. Thus, we propose GraLMatch, a method that can partially detect and remove false positive pairwise predictions through graph-based properties. Finally, we showcase how fine-tuning a Transformer-based model (DistilBERT) on a reduced number of labeled samples yields a better final entity group matching than training on more samples and/or incorporating fine-tuning optimizations, illustrating how precision becomes the deciding factor in the entity group matching of large volumes of records.
Paper Structure (29 sections, 4 figures, 4 tables, 1 algorithm)

This paper contains 29 sections, 4 figures, 4 tables, 1 algorithm.

Figures (4)

  • Figure 1: Illustration of the workflow of our entity group matching methodology.
  • Figure 2: An example dataset of companies (top part) and securities records (bottom part) to match across multiple data sources. Records #12, #22, #31 and #40 correspond to the same entity, "Crowdstrike". Record #12 can be matched to #31 because they have securities with a matching ISIN, US31807756E highlighted in orange. Equivalently with #22 and #40 with US318077DSIE highlighted in violet. Matching the entire group however, requires recognizing all of the different naming variations as equivalent (Crowdstrike Plt./Crowd Strike Platforms/ Crowdstrike Holdings etc.). This task is not trivial, since many false positive predictions are likely to happen with, for example, records #13, #23, #32 corresponding to the entity "Crowdstreet", due to the long shared character sequences across records.
  • Figure 3: Example of transitive matches between records of Figure \ref{['fig:matching_across_data_sources']}. On the left side, the pairwise matches ($\#$11 and $\#$21), ($\#$21 and $\#$33) and ($\#$33 and$\#$41) imply the transitive matches of the right side colored in green ($\#$11 and $\#$33), ($\#$11 and $\#$41) and ($\#$21 and $\#$41).
  • Figure 4: Illustration of entity group matching based on a subset of the records shown in Figure \ref{['fig:matching_across_data_sources']}. (1) Pairwise predictions: The false positive pairwise match between record #40 (Crowdstrike) and record #13 (Crowdstreet) is illustrated as a dotted orange line. (2) Pre Graph Cleanup: False transitive matches are shown as dotted red lines, e.g. record #12 (CrowdStrike) is wrongly matched transitively with record #13 (CrowdStreet). True positive pairwise and final matches are black lines. (3) Post Graph Cleanup: The false pairwise match, originally shown in orange, is eliminated via the GraLMatch Graph Cleanup. The results are two group matches as opposed to one group match resulting from wrong pairwise matching.