Table of Contents
Fetching ...

Parametric Retrieval-Augmented Generation using Latent Routing of LoRA Adapters

Zhan Su, Fengran Mo, Jian-yun Nie

TL;DR

Poly-PRAG introduces latent routing to Parametric Retrieval-Augmented Generation, replacing one LoRA adapter per document with a small set of latent adapters pooled across the document collection. A routing mechanism selects a sparse combination of adapters per document, enabling offline multi-task encoding and eliminating per-query adapter loading, which substantially reduces storage and inference overhead. Empirical results across four knowledge-intensive QA datasets demonstrate state-of-the-art performance and substantial efficiency gains, with up to >100x storage reduction and notable online speedups. The approach offers practical scalability for large corpora and provides insights into efficient external knowledge injection for RAG systems.

Abstract

Parametric Retrieval-Augmented Generation (PRAG) is a novel RAG paradigm that integrates external knowledge directly into a Large Language Model (LLM) by parameterizing documents using LoRA adapters, demonstrating reduced inference costs compared to traditional RAG approaches. However, current PRAG approaches adopt a \textbf{one-to-one} document encoding scheme, using a dedicated LoRA adapter for each individual document. This scheme introduces two major limitations: First, it leads to data scarcity, as the training datasets for individual LoRA adapters are limited. Second, it incurs high overhead during inference, requiring the merging of LLM weights with a new LoRA adapter for every candidate passage, which is computationally inefficient. To overcome these challenges, we propose a novel paradigm for encoding passages in PRAG that utilizes a latent routing encoding process (Poly-PRAG). During offline encoding, we treat the encoding of a set of documents as a multi-task learning process, where each passage is assigned a unique task identifier. By employing a routing function, we use a small set of latent LoRA adapters to encode the entire passage space. During online inference, this routing function selectively activates a subset of latent experts based on the input query. We conduct comprehensive evaluations of Poly-PRAG across multiple knowledge-intensive NLP tasks. Our extensive experiments demonstrate the effectiveness of the proposed method, achieving state-of-the-art results on four distinct datasets.

Parametric Retrieval-Augmented Generation using Latent Routing of LoRA Adapters

TL;DR

Poly-PRAG introduces latent routing to Parametric Retrieval-Augmented Generation, replacing one LoRA adapter per document with a small set of latent adapters pooled across the document collection. A routing mechanism selects a sparse combination of adapters per document, enabling offline multi-task encoding and eliminating per-query adapter loading, which substantially reduces storage and inference overhead. Empirical results across four knowledge-intensive QA datasets demonstrate state-of-the-art performance and substantial efficiency gains, with up to >100x storage reduction and notable online speedups. The approach offers practical scalability for large corpora and provides insights into efficient external knowledge injection for RAG systems.

Abstract

Parametric Retrieval-Augmented Generation (PRAG) is a novel RAG paradigm that integrates external knowledge directly into a Large Language Model (LLM) by parameterizing documents using LoRA adapters, demonstrating reduced inference costs compared to traditional RAG approaches. However, current PRAG approaches adopt a \textbf{one-to-one} document encoding scheme, using a dedicated LoRA adapter for each individual document. This scheme introduces two major limitations: First, it leads to data scarcity, as the training datasets for individual LoRA adapters are limited. Second, it incurs high overhead during inference, requiring the merging of LLM weights with a new LoRA adapter for every candidate passage, which is computationally inefficient. To overcome these challenges, we propose a novel paradigm for encoding passages in PRAG that utilizes a latent routing encoding process (Poly-PRAG). During offline encoding, we treat the encoding of a set of documents as a multi-task learning process, where each passage is assigned a unique task identifier. By employing a routing function, we use a small set of latent LoRA adapters to encode the entire passage space. During online inference, this routing function selectively activates a subset of latent experts based on the input query. We conduct comprehensive evaluations of Poly-PRAG across multiple knowledge-intensive NLP tasks. Our extensive experiments demonstrate the effectiveness of the proposed method, achieving state-of-the-art results on four distinct datasets.

Paper Structure

This paper contains 28 sections, 8 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Inference Comparison among different methods. (1) Standard RAG inputs the plaintext documents and questions. (2) PRAG generates QA pairs per document to fine-tune LoRA adapters, and sums them to obtain document-aggregated representations for LLM injection. (3) DyPRAG trains a hypternetwork to translate individual documents to its LoRA and averages them to achieve document aggregation for LLM injection. (4) For Poly-PRAG, a set of LoRA adapters are created for a document collection before the inference time, then selected through a routing function.
  • Figure 2: Poly-PRAG: For the offline encoding process, instead of encoding the document using a one-to-one paradigm. Poly-PRAG encodes the whole set of documents with a latent LoRA adapter. During inference time, give a query, once the document is retrieved, the routing function will select which latent LoRA adapters can be active in the generation.
  • Figure 3: Analysis of the number of latent experts to the F1 scores. The analysis is based on the LLama3-2.1B.
  • Figure 4: The figure presents the offline encoding time for three PRAG methods. The base language model is Llama3-1B. To isolate the performance difference, the LoRA rank is 2 across all methods. Note that the time reported for DyPRAG explicitly incorporates the training time for its translator module.
  • Figure 5: Online inference time comparison between PRAG and Poly-PRAG. We select the 2wikimultihopQA tasks based on Qwen2.5-1.5B. We use an A100 Nvidia 80G card to run the experiments.
  • ...and 1 more figures