Table of Contents
Fetching ...

KG-FIT: Knowledge Graph Fine-Tuning Upon Open-World Knowledge

Pengcheng Jiang, Lang Cao, Cao Xiao, Parminder Bhatia, Jimeng Sun, Jiawei Han

TL;DR

The effectiveness of KG-FIT in incorporating open-world knowledge from LLMs to significantly enhance the expressiveness and informativeness of KG embeddings is highlighted.

Abstract

Knowledge Graph Embedding (KGE) techniques are crucial in learning compact representations of entities and relations within a knowledge graph, facilitating efficient reasoning and knowledge discovery. While existing methods typically focus either on training KGE models solely based on graph structure or fine-tuning pre-trained language models with classification data in KG, KG-FIT leverages LLM-guided refinement to construct a semantically coherent hierarchical structure of entity clusters. By incorporating this hierarchical knowledge along with textual information during the fine-tuning process, KG-FIT effectively captures both global semantics from the LLM and local semantics from the KG. Extensive experiments on the benchmark datasets FB15K-237, YAGO3-10, and PrimeKG demonstrate the superiority of KG-FIT over state-of-the-art pre-trained language model-based methods, achieving improvements of 14.4%, 13.5%, and 11.9% in the Hits@10 metric for the link prediction task, respectively. Furthermore, KG-FIT yields substantial performance gains of 12.6%, 6.7%, and 17.7% compared to the structure-based base models upon which it is built. These results highlight the effectiveness of KG-FIT in incorporating open-world knowledge from LLMs to significantly enhance the expressiveness and informativeness of KG embeddings.

KG-FIT: Knowledge Graph Fine-Tuning Upon Open-World Knowledge

TL;DR

The effectiveness of KG-FIT in incorporating open-world knowledge from LLMs to significantly enhance the expressiveness and informativeness of KG embeddings is highlighted.

Abstract

Knowledge Graph Embedding (KGE) techniques are crucial in learning compact representations of entities and relations within a knowledge graph, facilitating efficient reasoning and knowledge discovery. While existing methods typically focus either on training KGE models solely based on graph structure or fine-tuning pre-trained language models with classification data in KG, KG-FIT leverages LLM-guided refinement to construct a semantically coherent hierarchical structure of entity clusters. By incorporating this hierarchical knowledge along with textual information during the fine-tuning process, KG-FIT effectively captures both global semantics from the LLM and local semantics from the KG. Extensive experiments on the benchmark datasets FB15K-237, YAGO3-10, and PrimeKG demonstrate the superiority of KG-FIT over state-of-the-art pre-trained language model-based methods, achieving improvements of 14.4%, 13.5%, and 11.9% in the Hits@10 metric for the link prediction task, respectively. Furthermore, KG-FIT yields substantial performance gains of 12.6%, 6.7%, and 17.7% compared to the structure-based base models upon which it is built. These results highlight the effectiveness of KG-FIT in incorporating open-world knowledge from LLMs to significantly enhance the expressiveness and informativeness of KG embeddings.
Paper Structure (35 sections, 28 equations, 9 figures, 14 tables, 3 algorithms)

This paper contains 35 sections, 28 equations, 9 figures, 14 tables, 3 algorithms.

Figures (9)

  • Figure 2: Overview of KG-FIT. Input and Output are highlighted at each step. Step 1: Obtain text embeddings for all entities in the KG, achieved by merging word embeddings with description embeddings retrieved from LLMs. Step 2: Hierarchical clustering is applied iteratively to all entity embeddings over various distance thresholds, monitored by a Silhouette scorer to identify optimal clusters, thus constructing a seed hierarchy where each leaf node represents a cluster of semantically similar entities. Step 3: Leveraging LLM guidance, the seed hierarchy is iteratively refined bottom-up through a series of suggested actions, aiming for a more accurate organization of KG entities with LLM's knowledge. Step 4: Use the refined hierarchy along with KG triples and the initial entity embeddings to fine-tune the embeddings under a series of distance constraints.
  • Figure 3: KG-FIT can mitigate overfitting (upper) and underfitting (lower) of structure-based models.
  • Figure 4: KG-FIT on FB15K-237 with different hierarchy types. None indicates no hierarchical information input. Seed denotes the seed hierarchy. G3.5/G4 denotes the LHR hierarchy constructed by GPT-3.5/4o. LHR hierarchies outperform the seed hierarchy, with more advanced LLMs constructing higher-quality hierarchies.
  • Figure 5: KG-FIT on FB15K-237 with different text embedding. BT, RBT, ada2, and te3 are BERT, RoBERTa, text-embedding-ada-002, and text-embedding-3-large, respectively. Seed hierarchy is used for all settings. It is observed that pre-trained text embeddings from LLMs are substantially better than those from small PLMs.
  • Figure 6: Visualization of Entity Embedding (left to right: initial text embedding, HAKE embedding, and $\text{KG-FIT}_{\text{HAKE}}$ embedding).Upper (local): Embeddings (dim=2048) of <Maraviroc, drug_effect, CAA (Coronary artery atherosclerosis)> and <Cladribine, drug_effect, Exertional dyspnea>, two parent-child triples selected from PrimeKG, in polar coordinate system. In the polar coordinate system, the normalized entity embedding $\bar{\mathbf{e}}$ is split to $\mathbf{e_1} = \bar{\mathbf{e}}[:\frac{n}{2}]$ and $\mathbf{e_2} = \bar{\mathbf{e}}[\frac{n}{2}+1:]$ where $n$ is the hidden dimension, which serves as values on the x-axis and y-axis, respectively, which is consistent with Zhang et al. zhang2020learning's visualization strategy. Lower (global): t-SNE plots of different embeddings of sampled entities, with colors indicating clusters (e.g., Maraviroc belongs to the HIV Drugs cluster). Triangles indicate the positions of $\blacktriangle$Maraviroc, $\blacktriangle$CAA, $\blacktriangle$Cladribine, and $\blacktriangle$Exertional dyspnea. Observations: While the initial text embeddings capture global semantics, they fail to delineate local parent-child relationships within the KG, as seen in the intermingled polar plots. In contrast, HAKE shows more distinct grouping by modulus on the polar plots, capturing hierarchical local semantics, but fails to adequately capture global semantics. Our $\texttt{KG-FIT}$, notably, incorporates prior information from LLMs and is fine-tuned on the KG, maintains global semantics from pre-trained text embeddings while better capturing local KG semantics, demonstrating its superior representational power across local and global scales.
  • ...and 4 more figures