Table of Contents
Fetching ...

Self-Routing RAG: Binding Selective Retrieval with Knowledge Verbalization

Di Wu, Jia-Chen Gu, Kai-Wei Chang, Nanyun Peng

TL;DR

SR-RAG addresses the gap where selective retrieval underutilizes the LLM's internal knowledge by introducing knowledge verbalization as a core component. It reframes retrieval as knowledge source selection between external retrieval and internal verbalization, and trains the model with a two-stage objective that couples source selection, verbalization, and answer generation, aided by a nearest-neighbor policy for robust inference under domain shifts. Empirically, SR-RAG outperforms always-retrieve and prior selective retrieval baselines across multiple LLMs and four knowledge-intensive tasks, while reducing retrieval calls by up to ~40% and achieving notable accuracy gains (e.g., up to ~19% depending on model). The approach offers scalable, efficient, and knowledge-aware RAG, with broad applicability to multi-source routing and cost-aware inference in real-world systems.

Abstract

Selective retrieval improves the accuracy and efficiency of retrieval-augmented generation (RAG) by reducing distractions from low-quality retrievals. However, existing approaches underutilize the inherent knowledge of large language models (LLMs), leading to suboptimal retrieval decisions and degraded generation performance. To bridge this gap, we propose Self-Routing RAG (SR-RAG), a novel framework that binds selective retrieval with knowledge verbalization. SR-RAG enables an LLM to dynamically decide whether to retrieve external knowledge or verbalize its own parametric knowledge. To this end, we design a multi-task objective that jointly optimizes an LLM for knowledge source selection, knowledge verbalization, and response generation. SR-RAG further incorporates a nearest neighbor search mechanism at inference time to improve the accuracy of knowledge source decisions under domain shifts. Fine-tuning three LLMs with SR-RAG significantly improves both their response accuracy and reduces the inference latency. Compared to the strongest selective retrieval baseline, SR-RAG reduces the number of retrievals by 29% while improving performance by 5.1%.

Self-Routing RAG: Binding Selective Retrieval with Knowledge Verbalization

TL;DR

SR-RAG addresses the gap where selective retrieval underutilizes the LLM's internal knowledge by introducing knowledge verbalization as a core component. It reframes retrieval as knowledge source selection between external retrieval and internal verbalization, and trains the model with a two-stage objective that couples source selection, verbalization, and answer generation, aided by a nearest-neighbor policy for robust inference under domain shifts. Empirically, SR-RAG outperforms always-retrieve and prior selective retrieval baselines across multiple LLMs and four knowledge-intensive tasks, while reducing retrieval calls by up to ~40% and achieving notable accuracy gains (e.g., up to ~19% depending on model). The approach offers scalable, efficient, and knowledge-aware RAG, with broad applicability to multi-source routing and cost-aware inference in real-world systems.

Abstract

Selective retrieval improves the accuracy and efficiency of retrieval-augmented generation (RAG) by reducing distractions from low-quality retrievals. However, existing approaches underutilize the inherent knowledge of large language models (LLMs), leading to suboptimal retrieval decisions and degraded generation performance. To bridge this gap, we propose Self-Routing RAG (SR-RAG), a novel framework that binds selective retrieval with knowledge verbalization. SR-RAG enables an LLM to dynamically decide whether to retrieve external knowledge or verbalize its own parametric knowledge. To this end, we design a multi-task objective that jointly optimizes an LLM for knowledge source selection, knowledge verbalization, and response generation. SR-RAG further incorporates a nearest neighbor search mechanism at inference time to improve the accuracy of knowledge source decisions under domain shifts. Fine-tuning three LLMs with SR-RAG significantly improves both their response accuracy and reduces the inference latency. Compared to the strongest selective retrieval baseline, SR-RAG reduces the number of retrievals by 29% while improving performance by 5.1%.

Paper Structure

This paper contains 54 sections, 5 equations, 10 figures, 8 tables, 1 algorithm.

Figures (10)

  • Figure 1: An overview of SR-RAG. Given a user query, the system first selects the most appropriate knowledge source by combining special token prediction with nearest neighbor search. Then, the knowledge is either retrieved from an external source or self-verbalized by the LLM. Finally, the LLM forms the response based on the query and the knowledge. All the steps are streamlined into a single left-to-right generation pass.
  • Figure 2: Compared to traditional selective RAG, SR-RAG enables an LLM to self-route between knowledge sources and self-act as a knowledge source. We use blue to represent external information and red to represent the LLM and its self-generated tokens.
  • Figure 3: Knowledge verbalization significantly affects the LLM ability boundary. For a large number of instances (16.4% - 38.8%, orange), GenRead reverses the knowledge source preferences: without considering GenRead, RAG dominates over parametric knowledge.
  • Figure 4: Accuracy and system latency of SR-RAG fine-tuned Llama-2-7B-Chat with different verbalization frequencies. SR-RAG's source selection policy (marked with stars) achieves near-optimal accuracy-efficiency trade-off without dataset-specific thresholds.
  • Figure 5: Prompt used for knowledge verbalization data collection via GenRead.
  • ...and 5 more figures