Table of Contents
Fetching ...

Empowering Small-Scale Knowledge Graphs: A Strategy of Leveraging General-Purpose Knowledge Graphs for Enriched Embeddings

Albert Sawczyn, Jakub Binkowski, Piotr Bielak, Tomasz Kajdanowicz

TL;DR

Knowledge-intensive tasks strain ML systems and LLMs often hallucinate; the paper proposes a modular framework to enrich small domain-specific KGs by aligning and linking them to a large general-purpose KG. It computes entity representations from labels and neighborhood context, connects DKG entities to $k$ nearest neighbors in the GKG to form a linked KG, and trains KG completion with a weighted loss that accounts for imperfect alignments via $ w_s = 1 / (1 + \mathrm{distance}(x(e_i), x(e_j)))$. The approach yields up to $44.9\%$ Hits@10 improvement in synthetic, data-scarce settings and meaningful gains in real-world scenarios depending on GKG suitability, demonstrating that small KGs can leverage broad knowledge to improve robustness and reduce hallucinations. The framework is modular and reproducible, enabling broader adoption of KGs in knowledge-intensive tasks and offering a practical pathway for enhancing downstream ML systems without excessive KG-building costs.

Abstract

Knowledge-intensive tasks pose a significant challenge for Machine Learning (ML) techniques. Commonly adopted methods, such as Large Language Models (LLMs), often exhibit limitations when applied to such tasks. Nevertheless, there have been notable endeavours to mitigate these challenges, with a significant emphasis on augmenting LLMs through Knowledge Graphs (KGs). While KGs provide many advantages for representing knowledge, their development costs can deter extensive research and applications. Addressing this limitation, we introduce a framework for enriching embeddings of small-scale domain-specific Knowledge Graphs with well-established general-purpose KGs. Adopting our method, a modest domain-specific KG can benefit from a performance boost in downstream tasks when linked to a substantial general-purpose KG. Experimental evaluations demonstrate a notable enhancement, with up to a 44% increase observed in the Hits@10 metric. This relatively unexplored research direction can catalyze more frequent incorporation of KGs in knowledge-intensive tasks, resulting in more robust, reliable ML implementations, which hallucinates less than prevalent LLM solutions. Keywords: knowledge graph, knowledge graph completion, entity alignment, representation learning, machine learning

Empowering Small-Scale Knowledge Graphs: A Strategy of Leveraging General-Purpose Knowledge Graphs for Enriched Embeddings

TL;DR

Knowledge-intensive tasks strain ML systems and LLMs often hallucinate; the paper proposes a modular framework to enrich small domain-specific KGs by aligning and linking them to a large general-purpose KG. It computes entity representations from labels and neighborhood context, connects DKG entities to nearest neighbors in the GKG to form a linked KG, and trains KG completion with a weighted loss that accounts for imperfect alignments via . The approach yields up to Hits@10 improvement in synthetic, data-scarce settings and meaningful gains in real-world scenarios depending on GKG suitability, demonstrating that small KGs can leverage broad knowledge to improve robustness and reduce hallucinations. The framework is modular and reproducible, enabling broader adoption of KGs in knowledge-intensive tasks and offering a practical pathway for enhancing downstream ML systems without excessive KG-building costs.

Abstract

Knowledge-intensive tasks pose a significant challenge for Machine Learning (ML) techniques. Commonly adopted methods, such as Large Language Models (LLMs), often exhibit limitations when applied to such tasks. Nevertheless, there have been notable endeavours to mitigate these challenges, with a significant emphasis on augmenting LLMs through Knowledge Graphs (KGs). While KGs provide many advantages for representing knowledge, their development costs can deter extensive research and applications. Addressing this limitation, we introduce a framework for enriching embeddings of small-scale domain-specific Knowledge Graphs with well-established general-purpose KGs. Adopting our method, a modest domain-specific KG can benefit from a performance boost in downstream tasks when linked to a substantial general-purpose KG. Experimental evaluations demonstrate a notable enhancement, with up to a 44% increase observed in the Hits@10 metric. This relatively unexplored research direction can catalyze more frequent incorporation of KGs in knowledge-intensive tasks, resulting in more robust, reliable ML implementations, which hallucinates less than prevalent LLM solutions. Keywords: knowledge graph, knowledge graph completion, entity alignment, representation learning, machine learning
Paper Structure (30 sections, 4 equations, 7 figures, 4 tables)

This paper contains 30 sections, 4 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Diagram presenting two aligned and linked Knowledge Graphs: domain-specific (upper), general-purpose (bottom). Artificial links, marked with dashed lines, connect two KG.
  • Figure 2: Overview diagram of the proposed framework's pipeline.
  • Figure 3: Synthetic scenario: performance boost across different sampling strategies and varying rates $p$. The boost is the performance improvement the framework achieves over a single graph. Please note that the relation sampling datasets should not be directly compared amongst themselves as the test was not preserved, causing high standard deviation (see \ref{['sec:sampling']}).
  • Figure 4: WN18RR: Performance across different sampling strategies and varying rates $p$. Two settings are shown: training on the single and linked graph. The green line shows the performance on the original graph ($p=1.0$). The shaded areas represent the standard deviation across multiple runs. Please note that the relation sampling datasets should not be directly compared amongst themselves as the test was not preserved, causing high standard deviation (see Section \ref{['sec:sampling']}).
  • Figure 5: FB15k-237: Performance across different sampling strategies and varying rates $p$. Two settings are shown: training on the single and linked graph. The green line shows the performance on the original graph ($p=1.0$). The shaded areas represent the standard deviation across multiple runs. Please note that the relation sampling datasets should not be directly compared amongst themselves as the test was not preserved, causing high standard deviation (see Section \ref{['sec:sampling']}).
  • ...and 2 more figures