Table of Contents
Fetching ...

RAG-Instruct: Boosting LLMs with Diverse Retrieval-Augmented Instructions

Wanlong Liu, Junying Chen, Ke Ji, Li Zhou, Wenyu Chen, Benyou Wang

TL;DR

RAG-Instruct tackles the limited scenario and task diversity gaps in retrieval-augmented generation by synthesizing a broad, high-quality RAG instruction dataset from any corpus. It combines five retrieval paradigms with Instruction Simulation to produce diverse, instruction-following data, exemplified by a 40K Wikipedia-based dataset. Across 11 tasks and zero-shot settings, RAG-Instruct-based models outperform baselines and robustly approach or exceed performance of some closed-source LLMs on several benchmarks, especially in multi-hop and domain-specific scenarios. This approach offers a scalable, generalizable pathway to strengthen RAG capabilities and reduce hallucination through diverse, retrieval-grounded instruction data.

Abstract

Retrieval-Augmented Generation (RAG) has emerged as a key paradigm for enhancing large language models (LLMs) by incorporating external knowledge. However, current RAG methods face two limitations: (1) they only cover limited RAG scenarios. (2) They suffer from limited task diversity due to the lack of a general RAG dataset. To address these limitations, we propose RAG-Instruct, a general method for synthesizing diverse and high-quality RAG instruction data based on any source corpus. Our approach leverages (1) five RAG paradigms, which encompass diverse query-document relationships, and (2) instruction simulation, which enhances instruction diversity and quality by utilizing the strengths of existing instruction datasets. Using this method, we construct a 40K instruction dataset from Wikipedia, comprehensively covering diverse RAG scenarios and tasks. Experiments demonstrate that RAG-Instruct effectively enhances LLMs' RAG capabilities, achieving strong zero-shot performance and significantly outperforming various RAG baselines across a diverse set of tasks. RAG-Instruct is publicly available at https://github.com/FreedomIntelligence/RAG-Instruct.

RAG-Instruct: Boosting LLMs with Diverse Retrieval-Augmented Instructions

TL;DR

RAG-Instruct tackles the limited scenario and task diversity gaps in retrieval-augmented generation by synthesizing a broad, high-quality RAG instruction dataset from any corpus. It combines five retrieval paradigms with Instruction Simulation to produce diverse, instruction-following data, exemplified by a 40K Wikipedia-based dataset. Across 11 tasks and zero-shot settings, RAG-Instruct-based models outperform baselines and robustly approach or exceed performance of some closed-source LLMs on several benchmarks, especially in multi-hop and domain-specific scenarios. This approach offers a scalable, generalizable pathway to strengthen RAG capabilities and reduce hallucination through diverse, retrieval-grounded instruction data.

Abstract

Retrieval-Augmented Generation (RAG) has emerged as a key paradigm for enhancing large language models (LLMs) by incorporating external knowledge. However, current RAG methods face two limitations: (1) they only cover limited RAG scenarios. (2) They suffer from limited task diversity due to the lack of a general RAG dataset. To address these limitations, we propose RAG-Instruct, a general method for synthesizing diverse and high-quality RAG instruction data based on any source corpus. Our approach leverages (1) five RAG paradigms, which encompass diverse query-document relationships, and (2) instruction simulation, which enhances instruction diversity and quality by utilizing the strengths of existing instruction datasets. Using this method, we construct a 40K instruction dataset from Wikipedia, comprehensively covering diverse RAG scenarios and tasks. Experiments demonstrate that RAG-Instruct effectively enhances LLMs' RAG capabilities, achieving strong zero-shot performance and significantly outperforming various RAG baselines across a diverse set of tasks. RAG-Instruct is publicly available at https://github.com/FreedomIntelligence/RAG-Instruct.
Paper Structure (38 sections, 3 equations, 12 figures, 9 tables)

This paper contains 38 sections, 3 equations, 12 figures, 9 tables.

Figures (12)

  • Figure 1: The main process of synthesizing data with RAG-Instruct. RAG-Instruct ensures instruction data diversity through five RAG paradigms and Instruction Simulation.
  • Figure 2: The prompt of RAG-Instruct. <document> and <Simulated Instruction> represent input variables for the document and simulated instruction, respectively. (Blue text) indicates RAG Paradigms, illustrating the prompt for $r_4$; other paradigms are shown in Appendix \ref{['ap-prompt']}. (Red text) represents Instruction Simulation.
  • Figure 3: The distributions of RAG paradigms and simulated instruction sources.
  • Figure 4: Some cases of RAG-Instruct for each RAG scenario. We compare the generated questions with and without using Instruction Simulation.
  • Figure 5: The prompt of filtering knowledge-intensive instructions from synthetic datasets
  • ...and 7 more figures