Table of Contents
Fetching ...

KBLaM: Knowledge Base augmented Language Model

Xi Wang, Taketomo Isazawa, Liana Mikaelyan, James Hensman

TL;DR

KBLaM addresses the challenge of augmenting pre-trained LLMs with external knowledge without fine-tuning or external retrieval at inference. It converts a knowledge base of triples into dense knowledge tokens via a pre-trained sentence encoder with linear adapters and injects them into the LLM through a rectangular attention mechanism, ensuring linear scaling with KB size and enabling dynamic updates. The approach yields interpretable attention patterns and end-to-end reasoning over large KBs (over 10K triples) on a single A100 GPU, with memory-efficient performance comparable to in-context learning on synthetic data and robust behavior on out-of-distribution data. Ablation studies highlight the roles of encoder choice, token frequency, and layer placement, and the work releases synthetic and real KBs to support further research in KB-based language understanding and long-context reasoning.

Abstract

In this paper, we propose Knowledge Base augmented Language Model (KBLaM), a new method for augmenting Large Language Models (LLMs) with external knowledge. KBLaM works with a knowledge base (KB) constructed from a corpus of documents, transforming each piece of knowledge in the KB into continuous key-value vector pairs via pre-trained sentence encoders with linear adapters and integrating them into pre-trained LLMs via a specialized rectangular attention mechanism. Unlike Retrieval-Augmented Generation, KBLaM eliminates external retrieval modules, and unlike in-context learning, its computational overhead scales linearly with KB size rather than quadratically. Our approach enables integrating a large KB of more than 10K triples into an 8B pre-trained LLM of only 8K context window on one single A100 80GB GPU and allows for dynamic updates without model fine-tuning or retraining. Experiments demonstrate KBLaM's effectiveness in various tasks, including question-answering and open-ended reasoning, while providing interpretable insights into its use of the augmented knowledge. Code and datasets are available at https://github.com/microsoft/KBLaM/

KBLaM: Knowledge Base augmented Language Model

TL;DR

KBLaM addresses the challenge of augmenting pre-trained LLMs with external knowledge without fine-tuning or external retrieval at inference. It converts a knowledge base of triples into dense knowledge tokens via a pre-trained sentence encoder with linear adapters and injects them into the LLM through a rectangular attention mechanism, ensuring linear scaling with KB size and enabling dynamic updates. The approach yields interpretable attention patterns and end-to-end reasoning over large KBs (over 10K triples) on a single A100 GPU, with memory-efficient performance comparable to in-context learning on synthetic data and robust behavior on out-of-distribution data. Ablation studies highlight the roles of encoder choice, token frequency, and layer placement, and the work releases synthetic and real KBs to support further research in KB-based language understanding and long-context reasoning.

Abstract

In this paper, we propose Knowledge Base augmented Language Model (KBLaM), a new method for augmenting Large Language Models (LLMs) with external knowledge. KBLaM works with a knowledge base (KB) constructed from a corpus of documents, transforming each piece of knowledge in the KB into continuous key-value vector pairs via pre-trained sentence encoders with linear adapters and integrating them into pre-trained LLMs via a specialized rectangular attention mechanism. Unlike Retrieval-Augmented Generation, KBLaM eliminates external retrieval modules, and unlike in-context learning, its computational overhead scales linearly with KB size rather than quadratically. Our approach enables integrating a large KB of more than 10K triples into an 8B pre-trained LLM of only 8K context window on one single A100 80GB GPU and allows for dynamic updates without model fine-tuning or retraining. Experiments demonstrate KBLaM's effectiveness in various tasks, including question-answering and open-ended reasoning, while providing interpretable insights into its use of the augmented knowledge. Code and datasets are available at https://github.com/microsoft/KBLaM/

Paper Structure

This paper contains 47 sections, 12 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Overview of the KBLaM pipeline and comparison with existing approaches.KBLaM augments knowledge into a pre-trained LLM in the form of knowledge tokens, a set of continuous key-value vectors, using a modified rectangular attention structure. Unlike RAG, KBLaM does not rely on separate retriever module at inference time and unlike in-context learning, KBLaM's computation and memory overhead scales linearly rather than quadratically with the size of the KB.
  • Figure 2: Overview of KBLaM's KB augmentation process
  • Figure 3: Memory overhead of different methods. Given a KB of $M$ triples (with each triple $K$-token long on average). In-context learning's memory scales with $(KM)^2$, whereas KBLaM's memory scales with $M$.
  • Figure 4: KBLaM's attention matrix is interpretable. Consider a toy KB of 10 triples, the heatmap shows the attention weights under different questions. We select the 15th attention layer and visualize the post-softmax attention score averaged over all attention heads. The x-axis only shows the keys corresponding the KB part and the y-axis is aligned by the end of each question's query.
  • Figure 5: Through instruction tuning, the attention shows retrieval behavior. Given simple Q&A on the validation set of the synthetic data (solid line) and Enron dataset (OOD, dashed line), we use the attention score at the 15th layer, averaged over all attention heads, as a classification score for each triple and measure the top-1 and top-5 accuracy. KBLaM assigns the highest attention score to the truly relevant triple most of the time (with performance degraded but still reasonable on OOD).
  • ...and 7 more figures