Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely

Siyun Zhao; Yuqing Yang; Zilong Wang; Zhiyuan He; Luna K. Qiu; Lili Qiu

Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely

Siyun Zhao, Yuqing Yang, Zilong Wang, Zhiyuan He, Luna K. Qiu, Lili Qiu

TL;DR

The paper targets the challenge of making large language models use external data wisely by introducing a four-level taxonomy of user queries: Explicit Facts, Implicit Facts, Interpretable Rationales, and Hidden Rationales, and by surveying how retrieval and data integration techniques can address each level. It formalizes the problem as $f: \mathcal{Q} \xrightarrow{\mathcal{D}} \mathcal{A}$ with external data $\mathcal{D}$ and defines Dep(q) to capture necessary data segments, guiding retrieval design. The core contributions include a comprehensive mapping of data-processing, retrieval, and generation enhancements for RAG, plus strategies like iterative RAG, graph/tree QA, NL2SQL, and rationale-based prompting (prompt tuning, CoT, offline learning, ICL, and fine-tuning). The survey provides concrete guidance on selecting data-injection mechanisms (context, small models, fine-tuning) and on building end-to-end pipelines that route queries to appropriate techniques, driving more reliable, domain-specific LLM applications. These insights aim to help practitioners decompose data requirements, anticipate bottlenecks, and systematically develop data-augmented LLM solutions across sectors such as healthcare, law, finance, and manufacturing.

Abstract

Large language models (LLMs) augmented with external data have demonstrated remarkable capabilities in completing real-world tasks. Techniques for integrating external data into LLMs, such as Retrieval-Augmented Generation (RAG) and fine-tuning, are gaining increasing attention and widespread application. Nonetheless, the effective deployment of data-augmented LLMs across various specialized fields presents substantial challenges. These challenges encompass a wide range of issues, from retrieving relevant data and accurately interpreting user intent to fully harnessing the reasoning capabilities of LLMs for complex tasks. We believe that there is no one-size-fits-all solution for data-augmented LLM applications. In practice, underperformance often arises from a failure to correctly identify the core focus of a task or because the task inherently requires a blend of multiple capabilities that must be disentangled for better resolution. In this survey, we propose a RAG task categorization method, classifying user queries into four levels based on the type of external data required and primary focus of the task: explicit fact queries, implicit fact queries, interpretable rationale queries, and hidden rationale queries. We define these levels of queries, provide relevant datasets, and summarize the key challenges and most effective techniques for addressing these challenges. Finally, we discuss three main forms of integrating external data into LLMs: context, small model, and fine-tuning, highlighting their respective strengths, limitations, and the types of problems they are suited to solve. This work aims to help readers thoroughly understand and decompose the data requirements and key bottlenecks in building LLM applications, offering solutions to the different challenges and serving as a guide to systematically developing such applications.

Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely

TL;DR

with external data

and defines Dep(q) to capture necessary data segments, guiding retrieval design. The core contributions include a comprehensive mapping of data-processing, retrieval, and generation enhancements for RAG, plus strategies like iterative RAG, graph/tree QA, NL2SQL, and rationale-based prompting (prompt tuning, CoT, offline learning, ICL, and fine-tuning). The survey provides concrete guidance on selecting data-injection mechanisms (context, small models, fine-tuning) and on building end-to-end pipelines that route queries to appropriate techniques, driving more reliable, domain-specific LLM applications. These insights aim to help practitioners decompose data requirements, anticipate bottlenecks, and systematically develop data-augmented LLM solutions across sectors such as healthcare, law, finance, and manufacturing.

Abstract

Paper Structure (31 sections, 3 equations, 6 figures, 1 table)

This paper contains 31 sections, 3 equations, 6 figures, 1 table.

Introduction
Problem Definition
Stratification of Queries
Explicit Fact Queries (L1)
Overview
Data Dependency
Definition
Challenges and Solutions
Retrieval-augmented Generation (RAG)
Data Processing Enhancement
Data Retrieval Enhancement
Response Generation Enhancement
Implicit Fact Queries (L2)
Overview
Challenges and Solutions
...and 16 more sections

Figures (6)

Figure 1: Main Focus of Four Level Queries
Figure 2: Summary of Query Levels in data augmented LLM applications
Figure 3: Three Types of Query-Document Alignment
Figure 4: Demonstration of Rationale Queries
Figure 5: Summary of Main Techniques for Different Query Levels in data augmented LLM applications
...and 1 more figures

Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely

TL;DR

Abstract

Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely

Authors

TL;DR

Abstract

Table of Contents

Figures (6)