GLaM: Fine-Tuning Large Language Models for Domain Knowledge Graph Alignment via Neighborhood Partitioning and Generative Subgraph Encoding
Stefan Dernbach, Khushbu Agarwal, Alejandro Zuniga, Michael Henry, Sutanay Choudhury
TL;DR
GLaM addresses the gap in grounding large language models with domain-specific knowledge graphs by fine-tuning LLMs on text representations of graph neighborhoods. It encodes $G_{context}(v,k)$ around each node into context–question–answer pairs using $f_{aggr}$, $f_{enc}$, and $f_{qa}$ under a token budget $T_{max}$ and neighborhood size $N_{max}$, exploring five encoding strategies. Evaluations on UMLS and DBLP show that graph-aligned fine-tuning improves fact recall and multi-hop reasoning over baselines, with summarization-based encodings and complete adjacency yielding strong gains; even smaller models can achieve substantial improvements. This approach tightly couples structured symbolic knowledge with neural representations, enabling more reliable, graph-informed reasoning for domain-specific QA tasks with potential impact in biomedical, academic, and enterprise knowledge graphs.
Abstract
Integrating large language models (LLMs) with knowledge graphs derived from domain-specific data represents an important advancement towards more powerful and factual reasoning. As these models grow more capable, it is crucial to enable them to perform multi-step inferences over real-world knowledge graphs while minimizing hallucination. While large language models excel at conversation and text generation, their ability to reason over domain-specialized graphs of interconnected entities remains limited. For example, can we query a LLM to identify the optimal contact in a professional network for a specific goal, based on relationships and attributes in a private database? The answer is no--such capabilities lie beyond current methods. However, this question underscores a critical technical gap that must be addressed. Many high-value applications in areas such as science, security, and e-commerce rely on proprietary knowledge graphs encoding unique structures, relationships, and logical constraints. We introduce a fine-tuning framework for developing Graph-aligned LAnguage Models (GLaM) that transforms a knowledge graph into an alternate text representation with labeled question-answer pairs. We demonstrate that grounding the models in specific graph-based knowledge expands the models' capacity for structure-based reasoning. Our methodology leverages the large-language model's generative capabilities to create the dataset and proposes an efficient alternate to retrieval-augmented generation styled methods.
