Table of Contents
Fetching ...

Tokenization, Fusion and Decoupling: Bridging the Granularity Mismatch Between Large Language Models and Knowledge Graphs

Siyue Su, Jian Yang, Bo Li, Guanglin Niu

TL;DR

KGT is proposed, a novel framework that uses dedicated entity tokens to enable efficient, full-space prediction and first introduces specialized tokenization to construct feature representations at the level of dedicated entity tokens.

Abstract

Leveraging Large Language Models (LLMs) for Knowledge Graph Completion (KGC) is promising but hindered by a fundamental granularity mismatch. LLMs operate on fragmented token sequences, whereas entities are the fundamental units in knowledge graphs (KGs) scenarios. Existing approaches typically constrain predictions to limited candidate sets or align entities with the LLM's vocabulary by pooling multiple tokens or decomposing entities into fixed-length token sequences, which fail to capture both the semantic meaning of the text and the structural integrity of the graph. To address this, we propose KGT, a novel framework that uses dedicated entity tokens to enable efficient, full-space prediction. Specifically, we first introduce specialized tokenization to construct feature representations at the level of dedicated entity tokens. We then fuse pre-trained structural and textual features into these unified embeddings via a relation-guided gating mechanism, avoiding training from scratch. Finally, we implement decoupled prediction by leveraging independent heads to separate and combine semantic and structural reasoning. Experimental results show that KGT consistently outperforms state-of-the-art methods across multiple benchmarks.

Tokenization, Fusion and Decoupling: Bridging the Granularity Mismatch Between Large Language Models and Knowledge Graphs

TL;DR

KGT is proposed, a novel framework that uses dedicated entity tokens to enable efficient, full-space prediction and first introduces specialized tokenization to construct feature representations at the level of dedicated entity tokens.

Abstract

Leveraging Large Language Models (LLMs) for Knowledge Graph Completion (KGC) is promising but hindered by a fundamental granularity mismatch. LLMs operate on fragmented token sequences, whereas entities are the fundamental units in knowledge graphs (KGs) scenarios. Existing approaches typically constrain predictions to limited candidate sets or align entities with the LLM's vocabulary by pooling multiple tokens or decomposing entities into fixed-length token sequences, which fail to capture both the semantic meaning of the text and the structural integrity of the graph. To address this, we propose KGT, a novel framework that uses dedicated entity tokens to enable efficient, full-space prediction. Specifically, we first introduce specialized tokenization to construct feature representations at the level of dedicated entity tokens. We then fuse pre-trained structural and textual features into these unified embeddings via a relation-guided gating mechanism, avoiding training from scratch. Finally, we implement decoupled prediction by leveraging independent heads to separate and combine semantic and structural reasoning. Experimental results show that KGT consistently outperforms state-of-the-art methods across multiple benchmarks.
Paper Structure (33 sections, 11 equations, 7 figures, 7 tables)

This paper contains 33 sections, 11 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: An illustration of existing two strategies of full-space LLM-based methods and KGT. (a) Pooling multiple tokens to unified representations for entities; (b) Decomposing entities into fixed-length sub-word sequences; (c) Constructing feature representations directly at the indivisible entity level.
  • Figure 2: Overview of the KGT framework. Part 1 illustrates the overall pipeline of KGT. The tokenizer first processes the input text containing the incomplete triple query, where entities and relations are represented as special tokens added to the original vocabulary. These special tokens obtain their embeddings via the Dual-Stream Specialized Token Embedding module. Subsequently, the LLM Backbone encodes the sequence, extracting the feature of the last token, which is then fed into the Dual-View Decoupled Predictor to generate the probability distribution over the entire entity vocabulary. Part 2 details the implementation of the Dual-Stream Specialized Token Embedding, where the dashed line indicates the assignment of the fused specialized feature to the special token representing the head entity h. Part 3 depicts the detailed architecture of the Dual-View Decoupled Predictor.
  • Figure 3: A comprhensive comparison between several viriants of KGT on DB15K.
  • Figure 4: Trainable parameters of some LLM-based KGC methods based on MKG-W.
  • Figure 5: Results of different logits scaling.
  • ...and 2 more figures