Table of Contents
Fetching ...

Heterogeneity in Entity Matching: A Survey and Experimental Analysis

Mohammad Hossein Moslemi, Amir Mousavi, Behshid Behkamal, Mostafa Milani

TL;DR

This paper addresses heterogeneity in entity matching (HEM) by formalizing a taxonomy that separates representation and semantic heterogeneity and linking these challenges to FAIR data principles. It provides a comprehensive survey of EM methods across schema-aware, representation-learning, graph-based, knowledge-graph, and LLM-based approaches, organizing them according to the HEM taxonomy. The authors also perform targeted experiments to assess robustness and generalization of state-of-the-art models under controlled semantic heterogeneity, revealing persistent limitations. They identify promising directions, including multimodal matching, human-in-the-loop workflows, deeper LLM/knowledge-graph integration, and fairness-aware evaluation in heterogeneous settings. Together, these contributions offer a framework for designing more robust, generalizable, and interoperable EM systems in real-world, heterogeneous data environments.

Abstract

Entity matching (EM) is a fundamental task in data integration and analytics, essential for identifying records that refer to the same real-world entity across diverse sources. In practice, datasets often differ widely in structure, format, schema, and semantics, creating substantial challenges for EM. We refer to this setting as Heterogeneous EM (HEM). This survey offers a unified perspective on HEM by introducing a taxonomy, grounded in prior work, that distinguishes two primary categories -- representation and semantic heterogeneity -- and their subtypes. The taxonomy provides a systematic lens for understanding how variations in data form and meaning shape the complexity of matching tasks. We then connect this framework to the FAIR principles -- Findability, Accessibility, Interoperability, and Reusability -- demonstrating how they both reveal the challenges of HEM and suggest strategies for mitigating them. Building on this foundation, we critically review recent EM methods, examining their ability to address different heterogeneity types, and conduct targeted experiments on state-of-the-art models to evaluate their robustness and adaptability under semantic heterogeneity. Our analysis uncovers persistent limitations in current approaches and points to promising directions for future research, including multimodal matching, human-in-the-loop workflows, deeper integration with large language models and knowledge graphs, and fairness-aware evaluation in heterogeneous settings.

Heterogeneity in Entity Matching: A Survey and Experimental Analysis

TL;DR

This paper addresses heterogeneity in entity matching (HEM) by formalizing a taxonomy that separates representation and semantic heterogeneity and linking these challenges to FAIR data principles. It provides a comprehensive survey of EM methods across schema-aware, representation-learning, graph-based, knowledge-graph, and LLM-based approaches, organizing them according to the HEM taxonomy. The authors also perform targeted experiments to assess robustness and generalization of state-of-the-art models under controlled semantic heterogeneity, revealing persistent limitations. They identify promising directions, including multimodal matching, human-in-the-loop workflows, deeper LLM/knowledge-graph integration, and fairness-aware evaluation in heterogeneous settings. Together, these contributions offer a framework for designing more robust, generalizable, and interoperable EM systems in real-world, heterogeneous data environments.

Abstract

Entity matching (EM) is a fundamental task in data integration and analytics, essential for identifying records that refer to the same real-world entity across diverse sources. In practice, datasets often differ widely in structure, format, schema, and semantics, creating substantial challenges for EM. We refer to this setting as Heterogeneous EM (HEM). This survey offers a unified perspective on HEM by introducing a taxonomy, grounded in prior work, that distinguishes two primary categories -- representation and semantic heterogeneity -- and their subtypes. The taxonomy provides a systematic lens for understanding how variations in data form and meaning shape the complexity of matching tasks. We then connect this framework to the FAIR principles -- Findability, Accessibility, Interoperability, and Reusability -- demonstrating how they both reveal the challenges of HEM and suggest strategies for mitigating them. Building on this foundation, we critically review recent EM methods, examining their ability to address different heterogeneity types, and conduct targeted experiments on state-of-the-art models to evaluate their robustness and adaptability under semantic heterogeneity. Our analysis uncovers persistent limitations in current approaches and points to promising directions for future research, including multimodal matching, human-in-the-loop workflows, deeper integration with large language models and knowledge graphs, and fairness-aware evaluation in heterogeneous settings.

Paper Structure

This paper contains 34 sections, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Taxonomy of heterogeneity in entity matching (HEM), including representation- and semantic-level variation.
  • Figure 2: Impact of synonym injection in test data across EM methods and datasets. Second-row figures provide detailed views of high-performing methods from the first-row figures.
  • Figure 3: Random word vs. synonym replacement in Abt-Buy
  • Figure 4: Random word vs. synonym replacement in WDC
  • Figure 5: Performance vs hierarchical data distortion (information loss) when changing test data.
  • ...and 4 more figures