Table of Contents
Fetching ...

Knowledge Homophily in Large Language Models

Utkarsh Sahu, Zhisheng Qi, Mahantesh Halappanavar, Nedim Lipka, Ryan A. Rossi, Franck Dernoncourt, Yu Zhang, Yao Ma, Yu Wang

TL;DR

A Graph Neural Network regression model is proposed to estimate entity-level knowledgeability scores for triplets by leveraging their neighborhood scores and improves the efficiency of active labeling for fine-tuning to inject knowledge into LLMs but also enhances multi-hop path retrieval in reasoning-intensive question answering.

Abstract

Large Language Models (LLMs) have been increasingly studied as neural knowledge bases for supporting knowledge-intensive applications such as question answering and fact checking. However, the structural organization of their knowledge remains unexplored. Inspired by cognitive neuroscience findings, such as semantic clustering and priming, where knowing one fact increases the likelihood of recalling related facts, we investigate an analogous knowledge homophily pattern in LLMs. To this end, we map LLM knowledge into a graph representation through knowledge checking at both the triplet and entity levels. After that, we analyze the knowledgeability relationship between an entity and its neighbors, discovering that LLMs tend to possess a similar level of knowledge about entities positioned closer in the graph. Motivated by this homophily principle, we propose a Graph Neural Network (GNN) regression model to estimate entity-level knowledgeability scores for triplets by leveraging their neighborhood scores. The predicted knowledgeability enables us to prioritize checking less well-known triplets, thereby maximizing knowledge coverage under the same labeling budget. This not only improves the efficiency of active labeling for fine-tuning to inject knowledge into LLMs but also enhances multi-hop path retrieval in reasoning-intensive question answering.

Knowledge Homophily in Large Language Models

TL;DR

A Graph Neural Network regression model is proposed to estimate entity-level knowledgeability scores for triplets by leveraging their neighborhood scores and improves the efficiency of active labeling for fine-tuning to inject knowledge into LLMs but also enhances multi-hop path retrieval in reasoning-intensive question answering.

Abstract

Large Language Models (LLMs) have been increasingly studied as neural knowledge bases for supporting knowledge-intensive applications such as question answering and fact checking. However, the structural organization of their knowledge remains unexplored. Inspired by cognitive neuroscience findings, such as semantic clustering and priming, where knowing one fact increases the likelihood of recalling related facts, we investigate an analogous knowledge homophily pattern in LLMs. To this end, we map LLM knowledge into a graph representation through knowledge checking at both the triplet and entity levels. After that, we analyze the knowledgeability relationship between an entity and its neighbors, discovering that LLMs tend to possess a similar level of knowledge about entities positioned closer in the graph. Motivated by this homophily principle, we propose a Graph Neural Network (GNN) regression model to estimate entity-level knowledgeability scores for triplets by leveraging their neighborhood scores. The predicted knowledgeability enables us to prioritize checking less well-known triplets, thereby maximizing knowledge coverage under the same labeling budget. This not only improves the efficiency of active labeling for fine-tuning to inject knowledge into LLMs but also enhances multi-hop path retrieval in reasoning-intensive question answering.

Paper Structure

This paper contains 26 sections, 4 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: We check whether LLM knows about triple facts and aggregate them to obtain entity knowledgeability scores. The visualized entity-level scores reveal the knowledge homophily, where topologically close entities form distinct high/log-knowledge (red/blue) communities. Graph layout is by ForceAtlas2 jacomy2014forceatlas2 to preserve topological proximity.
  • Figure 2: (a): Homophily distribution of node knowledgeability; (b): Average knowledge homophily across datasets/LLMs with black dashed line showing a classic high homophily Citeseer (0.74) dataset for node classification wang2021tree.
  • Figure 3: (a) Neighboring nodes possess similar knowledgeability scores to randomly sampled nodes ; (b) Entities with their distinct knowledgeability levels $\mathcal{K}(v)$ indicated by node color (Red = High, Blue = Low).
  • Figure 4: Homophily-guided Knowledge Injection and Retrieval: The process begins by training a GNN on a subset of entities with ground-truth knowledgeability scores (Blue Nodes) obtained by querying the base LLM. The trained GNN then infers the knowledgeability scores for all remaining entities (Green Nodes). Based on these predictions, triplets associated with entities estimated to have the lowest knowledge values are selected until the budget is met. Finally, the base LLM is fine-tuned on these less-known triplets to efficiently inject new knowledge, and its improved performance is measured on a held-out test set (Orange Nodes) in Figure \ref{['fig-knowledgeapplication']}(a). The estimated knowledgeability scores also guide retrieval, as illustrated in Figure \ref{['fig-knowledgeapplication']}(b).
  • Figure 5: The knowledge injection performance of the fine-tuned Mistral models on the CoDEx-S dataset across varying test set sizes. The GNN-guided approach maintains a significant performance advantage over other methods.
  • ...and 1 more figures