Multi-Source Knowledge Pruning for Retrieval-Augmented Generation: A Benchmark and Empirical Study

Shuo Yu; Mingyue Cheng; Qi Liu; Daoyu Wang; Jiqian Yang; Jie Ouyang; Yucong Luo; Chenyi Lei; Enhong Chen

Multi-Source Knowledge Pruning for Retrieval-Augmented Generation: A Benchmark and Empirical Study

Shuo Yu, Mingyue Cheng, Qi Liu, Daoyu Wang, Jiqian Yang, Jie Ouyang, Yucong Luo, Chenyi Lei, Enhong Chen

TL;DR

The paper tackles multi-source knowledge integration in retrieval-augmented generation (RAG) by standardizing a benchmark that combines structured API data and unstructured web content. It introduces PruningRAG, a plug-and-play framework that uses coarse- and fine-grained pruning to filter both sources and content, with tailored retrieval for web and API sources and knowledge-enhanced reasoning via CoT and ICL. By organizing inputs so the query follows the retrieved context and evaluating with both exact-match and semantic checks, the approach reduces hallucinations while preserving accuracy. Empirical results show consistent improvements across diverse RAG baselines and model scales, and the authors release the dataset and code to accelerate future research in multi-source RAG.

Abstract

Retrieval-augmented generation (RAG) is increasingly recognized as an effective approach to mitigating the hallucination of large language models (LLMs) through the integration of external knowledge. While numerous efforts, most studies focus on a single type of external knowledge source. However, in real-world applications, most situations involve diverse knowledge from various sources, yet this area has been less explored. The main dilemma is the lack of a suitable dataset containing multiple knowledge sources and pre-exploration of the associated issues. To address these challenges, we standardize a benchmark dataset that combines structured and unstructured knowledge across diverse and complementary domains. Based on this dataset, we further develop a plug-and-play RAG framework, \textbf{PruningRAG}, whose main characteristic is the use of multi-granularity pruning strategies to optimize the integration of relevant information while minimizing misleading context. It consistently improves performance across various existing RAG variants, demonstrating its robustness and broad applicability. Building upon the standardized dataset and PruningRAG, we also report a series of experimental results, as well as insightful findings. Our dataset and code are publicly available\footnote{https://github.com/USTCAGI/PruningRAG}, with the aim of advancing future research in the RAG community.

Multi-Source Knowledge Pruning for Retrieval-Augmented Generation: A Benchmark and Empirical Study

TL;DR

Abstract

Paper Structure (35 sections, 5 equations, 6 figures, 8 tables)

This paper contains 35 sections, 5 equations, 6 figures, 8 tables.

Introduction
Related Work
Retrieval-Augmented Generation
Existing Benchmarks for RAG
Preliminaries
Problem Formulation
A Multi-Source Knowledge RAG Dataset
Dataset Standardization
Dataset Overview
Methodology
Overview of the PruningRAG Framework
Multi-Source Knowledge Pruning
Coarse-Grained Knowledge Pruning
Fine-Grained Knowledge Pruning
Knowledge-Enhanced Reasoning
...and 20 more sections

Figures (6)

Figure 1: Comparison between Standard RAG and PruningRAG. Standard RAG typically relies on a single knowledge source for retrieval and generation. In contrast, PruningRAG enhances the utilization of multiple external knowledge sources by applying multi-granularity pruning strategies.
Figure 2: An illustration of PruningRAG, including multi-source knowledge pruning, knowledge reasoning and evaluation. Knowledge pruning filter out irrelevant knowledge sources and improve context relevance. The pruned knowledge is combined with the query to reason and the answer is evaluated based on accuracy and hallucination.
Figure 3: Prompt design template incorporating CoT and ICL for enhanced reasoning.
Figure 4: Comparative analysis of retrieval methods and prompt strategies. (a) Performance comparison of sparse, dense, and hybrid retrieval. Hybrid retrieval combines top-ranked chunks from both sparse and dense retrieval, which are run in parallel, with selection based on weighted proportion. (b) Comparison of Chain-of-Thought (CoT) prompting across different knowledge sources (web pages and API). (c) Evaluation of three query positioning strategies within the prompt: Query-Prepend (placing the query before the retrieved context), Query-Append (placing the query after the context), and Query-Surround (placing the query both before and after the context).
Figure 5: Sensitivity analysis of PruningRAG with respect to key retrieval configuration parameters. (a) Effect of varying chunk size. (b) Joint impact of chunk size and chunk overlap. (c) Influence of the number of retrieved chunks.
...and 1 more figures

Multi-Source Knowledge Pruning for Retrieval-Augmented Generation: A Benchmark and Empirical Study

TL;DR

Abstract

Multi-Source Knowledge Pruning for Retrieval-Augmented Generation: A Benchmark and Empirical Study

Authors

TL;DR

Abstract

Table of Contents

Figures (6)