Table of Contents
Fetching ...

Complex Ontology Matching with Large Language Model Embeddings

Guilherme Sousa, Rinaldo Lima, Cassia Trojahn

TL;DR

This work addresses the expressive matching gap in ontology and knowledge graph alignment by integrating large language model embeddings into a CANARD-based, SPARQL-guided framework. It introduces four embedding-based modifications—Label embedding similarity, Embeddings of SPARQL query, Subgraph embeddings, and Instance embeddings—to enhance how surrounding subgraphs are matched, with pre-trained models and no additional training. Through experiments on the populated OAEI Conference benchmark, the approach achieves superior precision and F-measure compared to the baseline and several state-of-the-art systems, while also offering insights into the impact of each modification. The method’s reliance on user-provided SPARQL needs and pre-trained embeddings makes it broadly applicable and scalable for complex matching tasks, with clear directions for future enhancements such as pure T-Box strategies and ontology partitioning.

Abstract

Ontology, and more broadly, Knowledge Graph Matching is a challenging task in which expressiveness has not been fully addressed. Despite the increasing use of embeddings and language models for this task, approaches for generating expressive correspondences still do not take full advantage of these models, in particular, large language models (LLMs). This paper proposes to integrate LLMs into an approach for generating expressive correspondences based on alignment need and ABox-based relation discovery. The generation of correspondences is performed by matching similar surroundings of instance sub-graphs. The integration of LLMs results in different architectural modifications, including label similarity, sub-graph matching, and entity matching. The performance word embeddings, sentence embeddings, and LLM-based embeddings, was compared. The results demonstrate that integrating LLMs surpasses all other models, enhancing the baseline version of the approach with a 45\% increase in F-measure.

Complex Ontology Matching with Large Language Model Embeddings

TL;DR

This work addresses the expressive matching gap in ontology and knowledge graph alignment by integrating large language model embeddings into a CANARD-based, SPARQL-guided framework. It introduces four embedding-based modifications—Label embedding similarity, Embeddings of SPARQL query, Subgraph embeddings, and Instance embeddings—to enhance how surrounding subgraphs are matched, with pre-trained models and no additional training. Through experiments on the populated OAEI Conference benchmark, the approach achieves superior precision and F-measure compared to the baseline and several state-of-the-art systems, while also offering insights into the impact of each modification. The method’s reliance on user-provided SPARQL needs and pre-trained embeddings makes it broadly applicable and scalable for complex matching tasks, with clear directions for future enhancements such as pure T-Box strategies and ontology partitioning.

Abstract

Ontology, and more broadly, Knowledge Graph Matching is a challenging task in which expressiveness has not been fully addressed. Despite the increasing use of embeddings and language models for this task, approaches for generating expressive correspondences still do not take full advantage of these models, in particular, large language models (LLMs). This paper proposes to integrate LLMs into an approach for generating expressive correspondences based on alignment need and ABox-based relation discovery. The generation of correspondences is performed by matching similar surroundings of instance sub-graphs. The integration of LLMs results in different architectural modifications, including label similarity, sub-graph matching, and entity matching. The performance word embeddings, sentence embeddings, and LLM-based embeddings, was compared. The results demonstrate that integrating LLMs surpasses all other models, enhancing the baseline version of the approach with a 45\% increase in F-measure.

Paper Structure

This paper contains 20 sections, 5 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Original architecture and steps where embeddings have been used. In the Figure, LES refers to Label embedding similarity, ESQ to Embeddings of SPARQL query, SE to Subgraph embeddings, and IE Instance embeddings.
  • Figure 2: Process of generating embeddings from a given label. Each label is tokenized, then the tokens are fed into the language model to generate the embeddings for each token. The last step is aggregating all token embeddings to generate a label embedding.
  • Figure 3: Example of the first three modifications in similarity computation. Label embedding similarity is the first modification where the Levenshtein similarity is replaced by embeddings similarity resulting in an n:m comparison. Embeddings of SPARQL query first the SPARQL query embeddings are aggregated and then compared with embeddings of subgraph labels resulting in a 1:m comparison. Subgraph embeddings are created by aggregating the embeddings of labels, resulting in a 1:1 comparison.
  • Figure 4: Example of a subject-type triple embedding, where the predicate embedding P and object embedding O form the final embedding. For predicate-type triples, the embeddings S and O are combined and for object-type triples, the embeddings P and O are combined. In binary queries the subgraphs are paths. The embedding of the nodes and predicated are aggregated independently and then the resulting embedding is aggregated to produce the final embedding representing the path.
  • Figure 5: Performance of the models when used with each architecture setting.