Table of Contents
Fetching ...

SR-KI: Scalable and Real-Time Knowledge Integration into LLMs via Supervised Attention

Bohan Yu, Wei Huang, Kang Liu

TL;DR

SR-KI tackles real-time knowledge integration into LLMs by encoding KBs as key-value pairs and injecting them into the model's latent KV cache. A two-stage training procedure identifies a privileged retrieval layer and applies a supervised attention loss to focus on relevant KB entries, enabling end-to-end inference without external retrievers. The approach scales to 40K KBs on a single A100 GPU, maintains high retrieval recalls (over 95% Recall@100) and strong QA performance, and achieves up to 99.75% KB compression with source attribution via Reference ID KBs. This yields a practical, interpretable, and scalable solution for dynamic knowledge updates in LLMs, addressing limitations of RAG and prior KV-projection methods.

Abstract

This paper proposes SR-KI, a novel approach for integrating real-time and large-scale structured knowledge bases (KBs) into large language models (LLMs). SR-KI begins by encoding KBs into key-value pairs using a pretrained encoder, and injects them into LLMs' KV cache. Building on this representation, we employ a two-stage training paradigm: first locating a dedicated retrieval layer within the LLM, and then applying an attention-based loss at this layer to explicitly supervise attention toward relevant KB entries. Unlike traditional retrieval-augmented generation methods that rely heavily on the performance of external retrievers and multi-stage pipelines, SR-KI supports end-to-end inference by performing retrieval entirely within the models latent space. This design enables efficient compression of injected knowledge and facilitates dynamic knowledge updates. Comprehensive experiments demonstrate that SR-KI enables the integration of up to 40K KBs into a 7B LLM on a single A100 40GB GPU, and achieves strong retrieval performance, maintaining over 98% Recall@10 on the best-performing task and exceeding 88% on average across all tasks. Task performance on question answering and KB ID generation also demonstrates that SR-KI maintains strong performance while achieving up to 99.75% compression of the injected KBs.

SR-KI: Scalable and Real-Time Knowledge Integration into LLMs via Supervised Attention

TL;DR

SR-KI tackles real-time knowledge integration into LLMs by encoding KBs as key-value pairs and injecting them into the model's latent KV cache. A two-stage training procedure identifies a privileged retrieval layer and applies a supervised attention loss to focus on relevant KB entries, enabling end-to-end inference without external retrievers. The approach scales to 40K KBs on a single A100 GPU, maintains high retrieval recalls (over 95% Recall@100) and strong QA performance, and achieves up to 99.75% KB compression with source attribution via Reference ID KBs. This yields a practical, interpretable, and scalable solution for dynamic knowledge updates in LLMs, addressing limitations of RAG and prior KV-projection methods.

Abstract

This paper proposes SR-KI, a novel approach for integrating real-time and large-scale structured knowledge bases (KBs) into large language models (LLMs). SR-KI begins by encoding KBs into key-value pairs using a pretrained encoder, and injects them into LLMs' KV cache. Building on this representation, we employ a two-stage training paradigm: first locating a dedicated retrieval layer within the LLM, and then applying an attention-based loss at this layer to explicitly supervise attention toward relevant KB entries. Unlike traditional retrieval-augmented generation methods that rely heavily on the performance of external retrievers and multi-stage pipelines, SR-KI supports end-to-end inference by performing retrieval entirely within the models latent space. This design enables efficient compression of injected knowledge and facilitates dynamic knowledge updates. Comprehensive experiments demonstrate that SR-KI enables the integration of up to 40K KBs into a 7B LLM on a single A100 40GB GPU, and achieves strong retrieval performance, maintaining over 98% Recall@10 on the best-performing task and exceeding 88% on average across all tasks. Task performance on question answering and KB ID generation also demonstrates that SR-KI maintains strong performance while achieving up to 99.75% compression of the injected KBs.

Paper Structure

This paper contains 46 sections, 6 equations, 9 figures, 11 tables.

Figures (9)

  • Figure 1: Illustration of the SR-KI training process with supervised attention. We incorporate the attention-based loss, computed using $A_{\text{KB}}^l$ from the retrieval layer, into the overall language modeling loss.
  • Figure 2: Illustration of the inference process: SR-KI selects top-$k$ KBs individually before the retrieval layer and reuses their indices for injection in later layers.
  • Figure 3: Left: reference ID accuracy and exact match (EM) accuracy for object without supervised attention training, using correct KB injected at a single layer. Middle: peak GPU memory usage of In-context Learning, KBLaM, and SR-KI under varying KB sizes (40GB limit shown). Right: results of unanswerable QA accuracy.
  • Figure 4: Task-specific KB attention weights at the retrieval layer: for Single-entity QA, correct KBs are placed at indices 0–1; for Multi-entity QA, at indices 0–3 for clarity. In Unanswerable QA, attention is spread across all entries.
  • Figure 5: Key-value representations of factual knowledge and reference ID KBs.
  • ...and 4 more figures