Table of Contents
Fetching ...

Provence: efficient and robust context pruning for retrieval-augmented generation

Nadezhda Chirkova, Thibault Formal, Vassilina Nikoulina, Stéphane Clinchant

TL;DR

The paper tackles the inefficiency and noise issues in retrieval-augmented generation by introducing Provence, a robust sentence-level context pruner trained as binary sequence labeling on a cross-encoder backbone. Provence can operate as a standalone pruner or be unified with reranking to make pruning effectively cost-free in the RAG pipeline, using a threshold $T$ to control pruning and sentence-rounding to preserve coherence. Trained on diverse data from MS MARCO and Natural Questions, and evaluated across multiple QA domains, Provence achieves a favorable Pareto front—maintaining QA performance with substantial context compression—and demonstrates robustness to context length and sentence ordering. The work also provides extensive ablations and analyses to guide future context-pruner design, with practical impact in making RAG systems faster and more reliable across domains, while noting limitations to English and single-passage QA.

Abstract

Retrieval-augmented generation improves various aspects of large language models (LLMs) generation, but suffers from computational overhead caused by long contexts as well as the propagation of irrelevant retrieved information into generated responses. Context pruning deals with both aspects, by removing irrelevant parts of retrieved contexts before LLM generation. Existing context pruning approaches are however limited, and do not provide a universal model that would be both efficient and robust in a wide range of scenarios, e.g., when contexts contain a variable amount of relevant information or vary in length, or when evaluated on various domains. In this work, we close this gap and introduce Provence (Pruning and Reranking Of retrieVEd relevaNt ContExts), an efficient and robust context pruner for Question Answering, which dynamically detects the needed amount of pruning for a given context and can be used out-of-the-box for various domains. The three key ingredients of Provence are formulating the context pruning task as sequence labeling, unifying context pruning capabilities with context reranking, and training on diverse data. Our experimental results show that Provence enables context pruning with negligible to no drop in performance, in various domains and settings, at almost no cost in a standard RAG pipeline. We also conduct a deeper analysis alongside various ablations to provide insights into training context pruners for future work.

Provence: efficient and robust context pruning for retrieval-augmented generation

TL;DR

The paper tackles the inefficiency and noise issues in retrieval-augmented generation by introducing Provence, a robust sentence-level context pruner trained as binary sequence labeling on a cross-encoder backbone. Provence can operate as a standalone pruner or be unified with reranking to make pruning effectively cost-free in the RAG pipeline, using a threshold to control pruning and sentence-rounding to preserve coherence. Trained on diverse data from MS MARCO and Natural Questions, and evaluated across multiple QA domains, Provence achieves a favorable Pareto front—maintaining QA performance with substantial context compression—and demonstrates robustness to context length and sentence ordering. The work also provides extensive ablations and analyses to guide future context-pruner design, with practical impact in making RAG systems faster and more reliable across domains, while noting limitations to English and single-passage QA.

Abstract

Retrieval-augmented generation improves various aspects of large language models (LLMs) generation, but suffers from computational overhead caused by long contexts as well as the propagation of irrelevant retrieved information into generated responses. Context pruning deals with both aspects, by removing irrelevant parts of retrieved contexts before LLM generation. Existing context pruning approaches are however limited, and do not provide a universal model that would be both efficient and robust in a wide range of scenarios, e.g., when contexts contain a variable amount of relevant information or vary in length, or when evaluated on various domains. In this work, we close this gap and introduce Provence (Pruning and Reranking Of retrieVEd relevaNt ContExts), an efficient and robust context pruner for Question Answering, which dynamically detects the needed amount of pruning for a given context and can be used out-of-the-box for various domains. The three key ingredients of Provence are formulating the context pruning task as sequence labeling, unifying context pruning capabilities with context reranking, and training on diverse data. Our experimental results show that Provence enables context pruning with negligible to no drop in performance, in various domains and settings, at almost no cost in a standard RAG pipeline. We also conduct a deeper analysis alongside various ablations to provide insights into training context pruners for future work.

Paper Structure

This paper contains 16 sections, 1 equation, 11 figures, 13 tables.

Figures (11)

  • Figure 1: Illustration of inference (left) and training (right) of Provence.
  • Figure 2: Main results for various QA domains, comparing Provence and baseline models. Generator: LLama-2-7B, retriever: SPLADE-v3, reranker: DeBERTa-v3 (or Provence in the unified setting). Plot titles denote "Dataset name (datastore type)". $x$-axis denotes QA performance evaluated with LLM-as-a-judge; $y$-axis denotes the context compression ratio. For both metrics, the higher the better: the best model would be closest to the top right corner. Numerical scores are presented in App. Tables \ref{['tab:mainnum1']}--\ref{['tab:mainnum2']}. Main conclusion: Provence consistently lies on the Pareto front.
  • Figure 3: Analyses. (Left) Needle-in-the-haystack test allowing the control of the position of the ground truth sentence(s) in the context. (Middle) Comparison of the number of selected sentences by the silver predictor (LLaMA-3-8B-Instruct) and Provence. Heatmaps are normalized by rows: a cell in position $(i, j)$ indicates which percentage of contexts that were pruned into $i$ sentences by the silver predictor, were pruned into $j$ sentences by Provence. (Right) Testing Provence in settings with different context lengths. All experiments are done with unified Provence, $T=0.1$.
  • Figure 4: Ablation results. All models are single-component modifications of the anchor model, which is a base-size model, trained on NQ data, with the answer oracle and token-level labeling. Numeric scores for this figure are duplicated in Appendix Table \ref{['tab:ablationsnum']}, and results with match-based metrics are presented in Appendix -- Figure \ref{['fig:ablation_match']}.
  • Figure 5: Statistics of the silver contexts labeled by LLaMA-3-8B-Instruct. (Left) the distribution of the number of sentences in silver contexts. (Right) the distribution of the position of the selected sentences in contexts.
  • ...and 6 more figures