Table of Contents
Fetching ...

FunnelRAG: A Coarse-to-Fine Progressive Retrieval Paradigm for RAG

Xinping Zhao, Yan Zhong, Zetian Sun, Xinshuo Hu, Zhenyu Liu, Dongfang Li, Baotian Hu, Min Zhang

TL;DR

FunnelRAG addresses the inefficiency and ceiling limits of flat retrieval in Retrieval-Augmented Generation by introducing a coarse-to-fine progressive retrieval pipeline that combines large-to-small candidate sets, coarse-to-fine granularity, and mixed-capacity retrievers. The methodology comprises three stages—Retrieval of long coarse units, Pre-ranking of documents within clusters, and Post-ranking of fine-grained passages—augmented by L2G distillation to align signals across stages. Empirical results on Natural Questions and TriviaQA show about a 40% reduction in retrieval time with comparable or improved answer recall, and generation benefits in most settings, especially at tighter cutoffs. The work demonstrates that orchestrating simple to complex retrievers along a progressive, aggregated signal pathway yields substantial efficiency gains while preserving retrieval quality and contextual integrity.”wrap in $...$ where appropriate for any mathematical notation in the text, though the summary primarily emphasizes methodology and results.

Abstract

Retrieval-Augmented Generation (RAG) prevails in Large Language Models. It mainly consists of retrieval and generation. The retrieval modules (a.k.a. retrievers) aim to find useful information used to facilitate the generation modules (a.k.a. generators). As such, generators' performance largely depends on the effectiveness and efficiency of retrievers. However, the widely used retrieval paradigm remains flat. It treats retrieval procedures as a one-off deal with constant granularity. Despite effectiveness, we argue that they suffer from two limitations: (1) flat retrieval exerts a significant burden on one retriever; (2) constant granularity limits the ceiling of retrieval performance. In this work, we propose a progressive retrieval paradigm with coarse-to-fine granularity for RAG, termed FunnelRAG, so as to balance effectiveness and efficiency. Specifically, FunnelRAG establishes a progressive retrieval pipeline by collaborating coarse-to-fine granularity, large-to-small quantity, and low-to-high capacity, which can relieve the burden on one retriever and also promote the ceiling of retrieval performance. Extensive experiments manifest that FunnelRAG achieves comparable retrieval performance while the time overhead is reduced by nearly 40 percent.

FunnelRAG: A Coarse-to-Fine Progressive Retrieval Paradigm for RAG

TL;DR

FunnelRAG addresses the inefficiency and ceiling limits of flat retrieval in Retrieval-Augmented Generation by introducing a coarse-to-fine progressive retrieval pipeline that combines large-to-small candidate sets, coarse-to-fine granularity, and mixed-capacity retrievers. The methodology comprises three stages—Retrieval of long coarse units, Pre-ranking of documents within clusters, and Post-ranking of fine-grained passages—augmented by L2G distillation to align signals across stages. Empirical results on Natural Questions and TriviaQA show about a 40% reduction in retrieval time with comparable or improved answer recall, and generation benefits in most settings, especially at tighter cutoffs. The work demonstrates that orchestrating simple to complex retrievers along a progressive, aggregated signal pathway yields substantial efficiency gains while preserving retrieval quality and contextual integrity.”wrap in where appropriate for any mathematical notation in the text, though the summary primarily emphasizes methodology and results.

Abstract

Retrieval-Augmented Generation (RAG) prevails in Large Language Models. It mainly consists of retrieval and generation. The retrieval modules (a.k.a. retrievers) aim to find useful information used to facilitate the generation modules (a.k.a. generators). As such, generators' performance largely depends on the effectiveness and efficiency of retrievers. However, the widely used retrieval paradigm remains flat. It treats retrieval procedures as a one-off deal with constant granularity. Despite effectiveness, we argue that they suffer from two limitations: (1) flat retrieval exerts a significant burden on one retriever; (2) constant granularity limits the ceiling of retrieval performance. In this work, we propose a progressive retrieval paradigm with coarse-to-fine granularity for RAG, termed FunnelRAG, so as to balance effectiveness and efficiency. Specifically, FunnelRAG establishes a progressive retrieval pipeline by collaborating coarse-to-fine granularity, large-to-small quantity, and low-to-high capacity, which can relieve the burden on one retriever and also promote the ceiling of retrieval performance. Extensive experiments manifest that FunnelRAG achieves comparable retrieval performance while the time overhead is reduced by nearly 40 percent.

Paper Structure

This paper contains 34 sections, 10 equations, 6 figures, 13 tables, 1 algorithm.

Figures (6)

  • Figure 1: Comparison between (a) the flat retrieval and (b) the progressive retrieval paradigm, where is the segmentation operation. FunnelRAG performs progressive retrieval from large to small quantity, from coarse to fine granularity, and with simple to complex retrievers, which balances effectiveness and efficiency.
  • Figure 2: AR w.r.t. coarse- and fine-grained retrieval. The bar denotes AR, while the line denotes the percentage of performance degradation compared to the cutoff position of 100%. The X-axis represents the percentage of units retrieved. Under the same percentile, the number of tokens retrieved by 'Fine' and 'Coarse' is equal.
  • Figure 3: Answer Recall w.r.t. high- and low-capacity retrievers. The line denotes the percentage of performance improvement compared to the low-capacity retriever.
  • Figure 4: The overall system framework of FunnelRAG. The upper layer illustrates the working flow of the flat retrieval paradigm, while the bottom layer illustrates the working flow of our progressive retrieval paradigm.
  • Figure 5: Model performance w.r.t. (a) different granularity of clustered documents, (b) different number of representative tokens, and (c) L2G distillation. #Rep tokens is the abbr of "the number of representative tokens".
  • ...and 1 more figures