Table of Contents
Fetching ...

Parametric Retrieval Augmented Generation

Weihang Su, Yichen Tang, Qingyao Ai, Junxi Yan, Changyue Wang, Hongning Wang, Ziyi Ye, Yujia Zhou, Yiqun Liu

TL;DR

Parametric RAG introduces a paradigm shift by embedding external knowledge into LLM parameters rather than prompting-time context. It employs offline document parameterization via document augmentation and LoRA-style parametric encoding, followed by a Retrieve-Update-Generate online workflow to fuse retrieved knowledge into the model’s FFN. Empirical results on multiple benchmarks show Parametric RAG often surpasses in-context RAG baselines in effectiveness and reduces online inference cost, with additive gains when combined with in-context approaches. The work discusses trade-offs, including offline preprocessing and storage, and points to future directions like model-agnostic representations and broader applications beyond RAG.

Abstract

Retrieval-augmented generation (RAG) techniques have emerged as a promising solution to enhance the reliability of large language models (LLMs) by addressing issues like hallucinations, outdated knowledge, and domain adaptation. In particular, existing RAG methods append relevant documents retrieved from external corpus or databases to the input of LLMs to guide their generation process, which we refer to as the in-context knowledge injection method. While this approach is simple and often effective, it has inherent limitations. Firstly, increasing the context length and number of relevant documents can lead to higher computational overhead and degraded performance, especially in complex reasoning tasks. More importantly, in-context knowledge injection operates primarily at the input level, but LLMs store their internal knowledge in their parameters. This gap fundamentally limits the capacity of in-context methods. To this end, we introduce Parametric retrieval-augmented generation (Parametric RAG), a new RAG paradigm that integrates external knowledge directly into the parameters of feed-forward networks (FFN) of an LLM through document parameterization. This approach not only saves online computational costs by eliminating the need to inject multiple documents into the LLMs' input context, but also deepens the integration of external knowledge into the parametric knowledge space of the LLM. Experimental results demonstrate that Parametric RAG substantially enhances both the effectiveness and efficiency of knowledge augmentation in LLMs. Also, it can be combined with in-context RAG methods to achieve even better performance. We have open-sourced all the code, data, and models in the following anonymized GitHub link: https://github.com/oneal2000/PRAG

Parametric Retrieval Augmented Generation

TL;DR

Parametric RAG introduces a paradigm shift by embedding external knowledge into LLM parameters rather than prompting-time context. It employs offline document parameterization via document augmentation and LoRA-style parametric encoding, followed by a Retrieve-Update-Generate online workflow to fuse retrieved knowledge into the model’s FFN. Empirical results on multiple benchmarks show Parametric RAG often surpasses in-context RAG baselines in effectiveness and reduces online inference cost, with additive gains when combined with in-context approaches. The work discusses trade-offs, including offline preprocessing and storage, and points to future directions like model-agnostic representations and broader applications beyond RAG.

Abstract

Retrieval-augmented generation (RAG) techniques have emerged as a promising solution to enhance the reliability of large language models (LLMs) by addressing issues like hallucinations, outdated knowledge, and domain adaptation. In particular, existing RAG methods append relevant documents retrieved from external corpus or databases to the input of LLMs to guide their generation process, which we refer to as the in-context knowledge injection method. While this approach is simple and often effective, it has inherent limitations. Firstly, increasing the context length and number of relevant documents can lead to higher computational overhead and degraded performance, especially in complex reasoning tasks. More importantly, in-context knowledge injection operates primarily at the input level, but LLMs store their internal knowledge in their parameters. This gap fundamentally limits the capacity of in-context methods. To this end, we introduce Parametric retrieval-augmented generation (Parametric RAG), a new RAG paradigm that integrates external knowledge directly into the parameters of feed-forward networks (FFN) of an LLM through document parameterization. This approach not only saves online computational costs by eliminating the need to inject multiple documents into the LLMs' input context, but also deepens the integration of external knowledge into the parametric knowledge space of the LLM. Experimental results demonstrate that Parametric RAG substantially enhances both the effectiveness and efficiency of knowledge augmentation in LLMs. Also, it can be combined with in-context RAG methods to achieve even better performance. We have open-sourced all the code, data, and models in the following anonymized GitHub link: https://github.com/oneal2000/PRAG

Paper Structure

This paper contains 28 sections, 6 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: An illustration of the comparison of in-context RAG and our proposed Parametric RAG paradigms: In-context RAG combines the tokens of relevant documents and the query in the input, using the original LLM $\theta$ to answer the question without modifying its parameters. Our proposed Parametric RAG updates the LLM’s parameters $\theta^{\prime} = \theta + \Delta \theta$ based on the retrieved documents, temporarily integrating relevant knowledge into LLM's parameters to answer the question.
  • Figure 2: An illustration of how we parameterize each document $d_i$ in the corpus during the Offline Document Parameterization stage.
  • Figure 3: Ablation study on the impact of the document augmentation stage. LLaMA indicates LLaMA-3.2-1B, and Qwen indicates Qwen-2.5-1.5B. The metric used is the F1 Score.