Table of Contents
Fetching ...

ConceptFormer: Towards Efficient Use of Knowledge-Graph Embeddings in Large Language Models

Joel Barmettler, Abraham Bernstein, Luca Rossetto

TL;DR

ConceptFormer proposes a vector-based KG integration method that injects compact concept vectors into the LLM embedding space to enhance factual recall without altering model weights. It introduces Tri-REx, T-REx Bite, and T-REx Star to evaluate one-hop knowledge recall and KG-based prompt augmentation, demonstrating substantial gains over baselines and graph-text RAG while dramatically reducing token usage. Across experiments on small LLMs, ConceptFormer achieves up to 348% improvements in Hit@10 on synthetic data and up to 272% on Wikipedia data, with a single vector offering meaningful boosts and roughly 130x fewer tokens compared to text-based graph expansions. The work offers a practical, scalable pathway for knowledge-grounded generation in IR pipelines and is adaptable to dynamic KG updates and larger LMs.

Abstract

Retrieval Augmented Generation (RAG) has enjoyed increased attention in the recent past and recent advancements in Large Language Models (LLMs) have highlighted the importance of integrating world knowledge into these systems. Current RAG methodologies often modify the internal architecture of pre-trained language models (PLMs) or rely on textifying knowledge graphs (KGs), which is inefficient in terms of token usage. This paper introduces ConceptFormer, a new approach to augment LLMs with structured knowledge from KGs, such as Wikidata, without altering their internal structure or relying on textual input of KGs. ConceptFormer operates in the LLM embedding vector space, creating and injecting \emph{concept vectors} that encapsulate the information of the KG nodes directly. Trained in conjunction with a frozen LLM, ConceptFormer generates a comprehensive lookup table that maps KG nodes to their respective concept vectors. The approach aims to enhance the factual recall capabilities of LLMs by enabling them to process these concept vectors natively, thus enriching them with structured world knowledge in an efficient and scalable manner. Our experiments demonstrate that the addition of concept vectors to GPT-2 0.1B substantially increases its factual recall ability (Hit@10) by up to 272\% when tested on sentences from Wikipedia and up to 348\% on synthetically generated sentences. Even injecting only a single concept vector into the prompt increases factual recall ability (Hit@10) by up to 213\% on Wikipedia sentences, significantly outperforming RAG with graph textification while consuming 130x fewer input tokens.

ConceptFormer: Towards Efficient Use of Knowledge-Graph Embeddings in Large Language Models

TL;DR

ConceptFormer proposes a vector-based KG integration method that injects compact concept vectors into the LLM embedding space to enhance factual recall without altering model weights. It introduces Tri-REx, T-REx Bite, and T-REx Star to evaluate one-hop knowledge recall and KG-based prompt augmentation, demonstrating substantial gains over baselines and graph-text RAG while dramatically reducing token usage. Across experiments on small LLMs, ConceptFormer achieves up to 348% improvements in Hit@10 on synthetic data and up to 272% on Wikipedia data, with a single vector offering meaningful boosts and roughly 130x fewer tokens compared to text-based graph expansions. The work offers a practical, scalable pathway for knowledge-grounded generation in IR pipelines and is adaptable to dynamic KG updates and larger LMs.

Abstract

Retrieval Augmented Generation (RAG) has enjoyed increased attention in the recent past and recent advancements in Large Language Models (LLMs) have highlighted the importance of integrating world knowledge into these systems. Current RAG methodologies often modify the internal architecture of pre-trained language models (PLMs) or rely on textifying knowledge graphs (KGs), which is inefficient in terms of token usage. This paper introduces ConceptFormer, a new approach to augment LLMs with structured knowledge from KGs, such as Wikidata, without altering their internal structure or relying on textual input of KGs. ConceptFormer operates in the LLM embedding vector space, creating and injecting \emph{concept vectors} that encapsulate the information of the KG nodes directly. Trained in conjunction with a frozen LLM, ConceptFormer generates a comprehensive lookup table that maps KG nodes to their respective concept vectors. The approach aims to enhance the factual recall capabilities of LLMs by enabling them to process these concept vectors natively, thus enriching them with structured world knowledge in an efficient and scalable manner. Our experiments demonstrate that the addition of concept vectors to GPT-2 0.1B substantially increases its factual recall ability (Hit@10) by up to 272\% when tested on sentences from Wikipedia and up to 348\% on synthetically generated sentences. Even injecting only a single concept vector into the prompt increases factual recall ability (Hit@10) by up to 213\% on Wikipedia sentences, significantly outperforming RAG with graph textification while consuming 130x fewer input tokens.

Paper Structure

This paper contains 25 sections, 2 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: ConceptFormer enhances a prompt by extending the embedding vectors from the original prompt with learned concept vectors. Entity recognition and entity linking is used to detect an entity "Albert Einstein" (displayed in blue) in the original prompt and link it towards a large KG like Wikidata. ConceptFormer creates a vector embedding for the detected entity that is compatible with the LLMs input embedding space. The resulting concept vector (displayed in green) is shown to capture the essence of the entity far better than the original token embedding vectors alone, leading to more knowledgeable output generated by the LLM (displayed in red).
  • Figure 2: Example datapoint from Tri-REx (Synthetic) Dataset. The datapoint consists of the main sentence(s), information about the mentioned Wikidata triple, as well as boundary indications of the entity label locations within the sentences(s).
  • Figure 3: The input of the ConceptFormer are three matrices, representing the central node, neighbouring nodes, and connecting edges. These embeddings can be generated with numerous text-embedding mechanisms. In our work, we generated the node and edge embeddings by simply forwarding their label through an LLM and averaged the last hidden layer. ConceptFormer trains multiple, parallel, and fully independent concept vector generator blocks, each implementing an attention mechanism in which the central node becomes the query Q, the concatenated neighbouring nodes and corresponding edges become the key K, and the neighbouring nodes become the values V. Finally, a shared dense network transforms the output of each concept vector generator block into the input embedding space of the LLM.
  • Figure 4: Hit@10 rate of various base models, with or without graph RAG (G-RAG), compared to GPT-2 0.1B with different ConceptFormers (CF), after pre-training on Tri-REx.
  • Figure 5: Hit@10 rate of GPT-2 0.1B + various ConceptFormers on the Wikipedia based T-Rex Bite Dataset.