Table of Contents
Fetching ...

Interactive Ontology Matching with Cost-Efficient Learning

Bin Cheng, Jonathan Fürst, Tobias Jacobs, Celia Garrido-Hidalgo

TL;DR

The paper tackles the last-mile challenge in ontology matching where fixed heuristics miss many true matches and traditional interactive systems struggle to scale. It introduces DualLoop, a cost-efficient active-learning framework that combines a weak-supervision ensemble, a fast exploitation loop, and a slow exploration loop to discover additional matches beyond initial heuristics. Empirical results across three diverse datasets show DualLoop yields higher F1 and recall while substantially reducing the human-query burden, outperforming both learning-based baselines and heuristic-only approaches. The method is deployed in TrioNet for industrial interlinking in the AEC sector, demonstrating strong practical value with notable reductions in verification efforts and improved match coverage.

Abstract

The creation of high-quality ontologies is crucial for data integration and knowledge-based reasoning, specifically in the context of the rising data economy. However, automatic ontology matchers are often bound to the heuristics they are based on, leaving many matches unidentified. Interactive ontology matching systems involving human experts have been introduced, but they do not solve the fundamental issue of flexibly finding additional matches outside the scope of the implemented heuristics, even though this is highly demanded in industrial settings. Active machine learning methods appear to be a promising path towards a flexible interactive ontology matcher. However, off-the-shelf active learning mechanisms suffer from low query efficiency due to extreme class imbalance, resulting in a last-mile problem where high human effort is required to identify the remaining matches. To address the last-mile problem, this work introduces DualLoop, an active learning method tailored to ontology matching. DualLoop offers three main contributions: (1) an ensemble of tunable heuristic matchers, (2) a short-term learner with a novel query strategy adapted to highly imbalanced data, and (3) long-term learners to explore potential matches by creating and tuning new heuristics. We evaluated DualLoop on three datasets of varying sizes and domains. Compared to existing active learning methods, we consistently achieved better F1 scores and recall, reducing the expected query cost spent on finding 90% of all matches by over 50%. Compared to traditional interactive ontology matchers, we are able to find additional, last-mile matches. Finally, we detail the successful deployment of our approach within an actual product and report its operational performance results within the Architecture, Engineering, and Construction (AEC) industry sector, showcasing its practical value and efficiency.

Interactive Ontology Matching with Cost-Efficient Learning

TL;DR

The paper tackles the last-mile challenge in ontology matching where fixed heuristics miss many true matches and traditional interactive systems struggle to scale. It introduces DualLoop, a cost-efficient active-learning framework that combines a weak-supervision ensemble, a fast exploitation loop, and a slow exploration loop to discover additional matches beyond initial heuristics. Empirical results across three diverse datasets show DualLoop yields higher F1 and recall while substantially reducing the human-query burden, outperforming both learning-based baselines and heuristic-only approaches. The method is deployed in TrioNet for industrial interlinking in the AEC sector, demonstrating strong practical value with notable reductions in verification efforts and improved match coverage.

Abstract

The creation of high-quality ontologies is crucial for data integration and knowledge-based reasoning, specifically in the context of the rising data economy. However, automatic ontology matchers are often bound to the heuristics they are based on, leaving many matches unidentified. Interactive ontology matching systems involving human experts have been introduced, but they do not solve the fundamental issue of flexibly finding additional matches outside the scope of the implemented heuristics, even though this is highly demanded in industrial settings. Active machine learning methods appear to be a promising path towards a flexible interactive ontology matcher. However, off-the-shelf active learning mechanisms suffer from low query efficiency due to extreme class imbalance, resulting in a last-mile problem where high human effort is required to identify the remaining matches. To address the last-mile problem, this work introduces DualLoop, an active learning method tailored to ontology matching. DualLoop offers three main contributions: (1) an ensemble of tunable heuristic matchers, (2) a short-term learner with a novel query strategy adapted to highly imbalanced data, and (3) long-term learners to explore potential matches by creating and tuning new heuristics. We evaluated DualLoop on three datasets of varying sizes and domains. Compared to existing active learning methods, we consistently achieved better F1 scores and recall, reducing the expected query cost spent on finding 90% of all matches by over 50%. Compared to traditional interactive ontology matchers, we are able to find additional, last-mile matches. Finally, we detail the successful deployment of our approach within an actual product and report its operational performance results within the Architecture, Engineering, and Construction (AEC) industry sector, showcasing its practical value and efficiency.
Paper Structure (31 sections, 4 equations, 9 figures, 5 tables, 2 algorithms)

This paper contains 31 sections, 4 equations, 9 figures, 5 tables, 2 algorithms.

Figures (9)

  • Figure 1: Aligning different source ontologies with a backbone ontology to facilitate interoperability. Domain experts are available to support an automated system.
  • Figure 2: Problems with existing interactive matchers and active learning on the Conference benchmark dataset OAEI-Conference: interactive ontology matchers allow for human-in-the-loop user annotations, but their performance is limited, and users have no option to provide further annotations to improve performance beyond a certain point; active learning exhibits an extremely low sample efficiency, impractical for real applications.
  • Figure 3: Overview of the DualLoop System. DualLoop encompasses two learning loops that run in parallel: (1) the fast loop picks class pairs, queries the domain expert, and update the prediction results for remaining unlabeled class pairs; (2) the slow loop creates and updates tunable labeling functions based on new annotation batches.
  • Figure 4: F1 score plotted against the percentage of overall query budget for different datasets. DualLoop clearly achieves higher F1 scores than the other two methods on all datasets by improving the matching quality in face of small query budgets.
  • Figure 5: F1 score achieved on the conference dataset OAEI-Conference by our approach (DualLoop), as compared to existing active learning methods without (AL-RF) and with weak supervision (WeSAL), plus state of the art interactive ontology matchers (AML and LogMap). DualLoop already outperforms all other methods with less than 30 user annotations.
  • ...and 4 more figures