Table of Contents
Fetching ...

K-ON: Stacking Knowledge On the Head Layer of Large Language Model

Lingbing Guo, Yichi Zhang, Zhongpu Bo, Zhuo Chen, Mengshu Sun, Zhiqiang Zhang, Wen Zhang, Huajun Chen

TL;DR

The paper tackles the granularity mismatch between knowledge graphs and token-level LLM predictions. It introduces K-ON, which stacks knowledge on the head layer of LLMs by using $K$ head modules to predict the target entity in a single pass and trains with an entity-level contrastive loss. Key components include Head MLPs, conditional attention, LoRA score layers, $K$-step gathering, and Head Trajectory Tuning to align the $K$-step predictions with the original single-step distribution. Experiments on DB15K and MKGW show K-ON outperforms state-of-the-art KG completion methods, including multi-modal baselines, while offering improved training efficiency.

Abstract

Recent advancements in large language models (LLMs) have significantly improved various natural language processing (NLP) tasks. Typically, LLMs are trained to predict the next token, aligning well with many NLP tasks. However, in knowledge graph (KG) scenarios, entities are the fundamental units and identifying an entity requires at least several tokens. This leads to a granularity mismatch between KGs and natural languages. To address this issue, we propose K-ON, which integrates KG knowledge into the LLM by employing multiple head layers for next k-step prediction. K-ON can not only generate entity-level results in one step, but also enables contrastive loss against entities, which is the most powerful tool in KG representation learning. Experimental results show that K-ON outperforms state-of-the-art methods that incorporate text and even the other modalities.

K-ON: Stacking Knowledge On the Head Layer of Large Language Model

TL;DR

The paper tackles the granularity mismatch between knowledge graphs and token-level LLM predictions. It introduces K-ON, which stacks knowledge on the head layer of LLMs by using head modules to predict the target entity in a single pass and trains with an entity-level contrastive loss. Key components include Head MLPs, conditional attention, LoRA score layers, -step gathering, and Head Trajectory Tuning to align the -step predictions with the original single-step distribution. Experiments on DB15K and MKGW show K-ON outperforms state-of-the-art KG completion methods, including multi-modal baselines, while offering improved training efficiency.

Abstract

Recent advancements in large language models (LLMs) have significantly improved various natural language processing (NLP) tasks. Typically, LLMs are trained to predict the next token, aligning well with many NLP tasks. However, in knowledge graph (KG) scenarios, entities are the fundamental units and identifying an entity requires at least several tokens. This leads to a granularity mismatch between KGs and natural languages. To address this issue, we propose K-ON, which integrates KG knowledge into the LLM by employing multiple head layers for next k-step prediction. K-ON can not only generate entity-level results in one step, but also enables contrastive loss against entities, which is the most powerful tool in KG representation learning. Experimental results show that K-ON outperforms state-of-the-art methods that incorporate text and even the other modalities.

Paper Structure

This paper contains 28 sections, 13 equations, 4 figures, 4 tables, 1 algorithm.

Figures (4)

  • Figure 1: A comparison of single-step prediction and the proposed K-ON prediction. Left: in the conventional single-step prediction, obtaining an output of an entity necessitates recurrently feeding input data and cannot be parallelized across different entities. Right: the K-ON prediction generates an entity in a single step and allows for parallelization across multiple entities, thereby enabling entity-level contrastive learning.
  • Figure 2: Overview of the K-ON architecture. From left to right: (1) The LLM processes the input text containing incomplete triplet information; (2) The resulting hidden states are input to distinct head MLPs within K-ON; (3) A compact conditional Transformer refines the corresponding outputs to capture sequential dependencies; (4) LoRA score layers are employed to transform the hidden states into $K$ probability distribution estimations; (5-6) Aggregating the elements from the respective probability vectors, K-ON computes the probabilities of all candidate entities simultaneously.
  • Figure 3: Performance of K-ON w.r.t. the number of K-ON head layers $k$. The results are obtained using $8$ A100 GPUs.
  • Figure 4: Performance of K-ON w.r.t. the number of negative entities. The results are obtained using $8$ A100 GPUs.