Table of Contents
Fetching ...

GraphMERT: Efficient and Scalable Distillation of Reliable Knowledge Graphs from Unstructured Data

Margarita Belova, Jiaxin Xiao, Shikhar Tuli, Niraj K. Jha

TL;DR

GraphMERT addresses the challenge of reliably distilling domain specific knowledge graphs from limited high quality data by learning cross modal semantic and syntactic representations in an encoder only transformer augmented with graph attention. The framework couples a seed knowledge graph with a small text corpus to train a compact 80M parameter model that distills structured semantic relations from neural weights into explicit triples. It introduces leafy chain graphs, a joint MLM and MNM pretraining objective, and an injection mechanism that preserves ontological validity while expanding vocabularies. Comprehensive experiments in the diabetes domain show GraphMERT achieves superior factuality and validity of triples, with higher GraphRAG based downstream QA performance compared to an LLM baseline. The results support a neurosymbolic path toward domain specific superintelligence, combining scalable neural learning with auditable symbolic knowledge for trustworthy AI in high stakes domains.

Abstract

Researchers have pursued neurosymbolic artificial intelligence (AI) applications for nearly three decades because symbolic components provide abstraction while neural components provide generalization. Thus, a marriage of the two components can lead to rapid advancements in AI. Yet, the field has not realized this promise since most neurosymbolic AI frameworks fail to scale. In addition, the implicit representations and approximate reasoning of neural approaches limit interpretability and trust. Knowledge graphs (KGs), a gold-standard representation of explicit semantic knowledge, can address the symbolic side. However, automatically deriving reliable KGs from text corpora has remained an open problem. We address these challenges by introducing GraphMERT, a tiny graphical encoder-only model that distills high-quality KGs from unstructured text corpora and its own internal representations. GraphMERT and its equivalent KG form a modular neurosymbolic stack: neural learning of abstractions; symbolic KGs for verifiable reasoning. GraphMERT + KG is the first efficient and scalable neurosymbolic model to achieve state-of-the-art benchmark accuracy along with superior symbolic representations relative to baselines. Concretely, we target reliable domain-specific KGs that are both (1) factual (with provenance) and (2) valid (ontology-consistent relations with domain-appropriate semantics). When a large language model (LLM), e.g., Qwen3-32B, generates domain-specific KGs, it falls short on reliability due to prompt sensitivity, shallow domain expertise, and hallucinated relations. On text obtained from PubMed papers on diabetes, our 80M-parameter GraphMERT yields a KG with a 69.8% FActScore; a 32B-parameter baseline LLM yields a KG that achieves only 40.2% FActScore. The GraphMERT KG also attains a higher ValidityScore of 68.8%, versus 43.0% for the LLM baseline.

GraphMERT: Efficient and Scalable Distillation of Reliable Knowledge Graphs from Unstructured Data

TL;DR

GraphMERT addresses the challenge of reliably distilling domain specific knowledge graphs from limited high quality data by learning cross modal semantic and syntactic representations in an encoder only transformer augmented with graph attention. The framework couples a seed knowledge graph with a small text corpus to train a compact 80M parameter model that distills structured semantic relations from neural weights into explicit triples. It introduces leafy chain graphs, a joint MLM and MNM pretraining objective, and an injection mechanism that preserves ontological validity while expanding vocabularies. Comprehensive experiments in the diabetes domain show GraphMERT achieves superior factuality and validity of triples, with higher GraphRAG based downstream QA performance compared to an LLM baseline. The results support a neurosymbolic path toward domain specific superintelligence, combining scalable neural learning with auditable symbolic knowledge for trustworthy AI in high stakes domains.

Abstract

Researchers have pursued neurosymbolic artificial intelligence (AI) applications for nearly three decades because symbolic components provide abstraction while neural components provide generalization. Thus, a marriage of the two components can lead to rapid advancements in AI. Yet, the field has not realized this promise since most neurosymbolic AI frameworks fail to scale. In addition, the implicit representations and approximate reasoning of neural approaches limit interpretability and trust. Knowledge graphs (KGs), a gold-standard representation of explicit semantic knowledge, can address the symbolic side. However, automatically deriving reliable KGs from text corpora has remained an open problem. We address these challenges by introducing GraphMERT, a tiny graphical encoder-only model that distills high-quality KGs from unstructured text corpora and its own internal representations. GraphMERT and its equivalent KG form a modular neurosymbolic stack: neural learning of abstractions; symbolic KGs for verifiable reasoning. GraphMERT + KG is the first efficient and scalable neurosymbolic model to achieve state-of-the-art benchmark accuracy along with superior symbolic representations relative to baselines. Concretely, we target reliable domain-specific KGs that are both (1) factual (with provenance) and (2) valid (ontology-consistent relations with domain-appropriate semantics). When a large language model (LLM), e.g., Qwen3-32B, generates domain-specific KGs, it falls short on reliability due to prompt sensitivity, shallow domain expertise, and hallucinated relations. On text obtained from PubMed papers on diabetes, our 80M-parameter GraphMERT yields a KG with a 69.8% FActScore; a 32B-parameter baseline LLM yields a KG that achieves only 40.2% FActScore. The GraphMERT KG also attains a higher ValidityScore of 68.8%, versus 43.0% for the LLM baseline.

Paper Structure

This paper contains 105 sections, 11 equations, 14 figures, 24 tables, 1 algorithm.

Figures (14)

  • Figure 1: A toy KG example from the medical domain.
  • Figure 2: Overview of the GraphMERT framework. It is trained on the fusion of syntactic and semantic examples (II) and augments syntactic data with semantic tails (I); an LLM helps determine the linguistic structure of tails proposed by GraphMERT(III). (I): Chain graph (Ic) combines syntactic knowledge from text corpora (Ib) with semantic examples and relations from a seed KG (Ia): Roots hold syntactic knowledge (in orange), sparse leaves hold semantic examples (in blue), and edges encode semantic relations (purple arrows). (II): GraphMERT is trained on chain graphs to align semantic examples with their syntactic context (IIa). It then predicts novel semantic token completions for chain graphs without injections, using their syntactic information as context (IIb). (III): An LLM combines raw semantic token completions from GraphMERT into grammatically well-formed triple tails, producing complete triples. After filtering them by similarity to the source syntactic context and dropping duplicate triples, we obtain the final KG.
  • Figure 3: Chain graph. Roots are in orange, leaves are in blue. Conceptual representation (A, B): term level, each circle is a term. Actual representation in training (C): token level, each square is a token. Each term can be multi-token. (A) No injections, all leaves are empty. (B) One root node has a leaf term. (C) Token-level representation for the 3-leaf case. Here, the leaf in (B) is encoded with a maximum of three tokens and padded to the maximum length if needed. Root term comprises two tokens, and tail term also comprises two tokens that are connected to the first root token.
  • Figure 4: Main GraphMERT architectural components. GraphMERT is a RoBERTa transformer with two modifications. (I) In the embedding layer, H-GAT encodes semantic triples. (IA) There are leaves connected to a root node; hence, the node feature is equal to the token embedding. (IB) There are leaves connected to a root node; H-GAT fuses leaves, relations, and head embeddings resulting in fused node feature. (II) In the attention layers, attention weights are multiplied by a function that exponentially decreases with pairwise distance. They encode graph relations and graph distance, respectively. The input is either a node feature or a fused node feature.
  • Figure 5: Semantic embedding derivation on leaves (only three leaves are shown). $h_i$: head token, $l_i$: leaf token, $t$: syntactic context token. For every injected triple, H-GAT fuses each leaf token with the relation and all the head tokens, yielding an embedding of the same dimension as the initial leaf token embedding. The derived embedding replaces the initial leaf embedding.
  • ...and 9 more figures