Table of Contents
Fetching ...

Relevance Isn't All You Need: Scaling RAG Systems With Inference-Time Compute Via Multi-Criteria Reranking

Will LeVine, Bijan Varjavand

TL;DR

REBEL introduces a multi-criteria reranking framework for retrieval-augmented generation that uses Chain-of-Thought prompting to optimize not only relevance but also secondary properties such as depth, diversity, clarity, authoritativeness, and recency. It proposes two strategies: a fixed one-turn approach and a dynamic two-turn approach that infer query-specific criteria, yielding improved end-to-end performance measured by answer similarity and retrieval precision. Through a carefully controlled experimental setup with a large arXiv-derived dataset and multiple LLMs, the authors show that incorporating secondary criteria can overcome the traditional relevance-only information bottleneck and enable a favorable compute/quality tradeoff curve. The work also discusses safety considerations, potential extensions, and provides public code resources to reproduce and build upon the results.

Abstract

Modern Large Language Model (LLM) systems typically rely on Retrieval Augmented Generation (RAG) which aims to gather context that is useful for response generation. These RAG systems typically optimize strictly towards retrieving context that is maximally relevant to the query. However, conventional theory suggests that retrieval systems which seek to maximize context relevance without any additional explicit criteria can create information bottlenecks. We reaffirm this finding in the modern age of LLM's by showing that in standard RAG pipelines, maximizing for context relevance alone can degrade downstream response quality. In response, we show evaluations of existing RAG methods which account for both context relevance and answer quality. These evaluations introduce a novel finding that existing RAG systems scale poorly with inference time compute usage when considering our combined metric. We introduce "RErank BEyond reLevance (REBEL)", which enables RAG systems to scale with inference-time compute via injection of multi-criteria optimization using Chain-of-Thought prompting (and optionally Multi-Turn dialogue). Ultimately, this enables a new performance/speed tradeoff curve, where RAG systems are able to achieve both higher relevance of retrieved contexts and superior answer quality as inference time increases. Code for the implementation of our method in llama-index can be found at the following PR: https://github.com/run-llama/llama_index/pull/17590. Code for running experiments using this llama-index implementation can be found at https://github.com/microsoft/REBEL.

Relevance Isn't All You Need: Scaling RAG Systems With Inference-Time Compute Via Multi-Criteria Reranking

TL;DR

REBEL introduces a multi-criteria reranking framework for retrieval-augmented generation that uses Chain-of-Thought prompting to optimize not only relevance but also secondary properties such as depth, diversity, clarity, authoritativeness, and recency. It proposes two strategies: a fixed one-turn approach and a dynamic two-turn approach that infer query-specific criteria, yielding improved end-to-end performance measured by answer similarity and retrieval precision. Through a carefully controlled experimental setup with a large arXiv-derived dataset and multiple LLMs, the authors show that incorporating secondary criteria can overcome the traditional relevance-only information bottleneck and enable a favorable compute/quality tradeoff curve. The work also discusses safety considerations, potential extensions, and provides public code resources to reproduce and build upon the results.

Abstract

Modern Large Language Model (LLM) systems typically rely on Retrieval Augmented Generation (RAG) which aims to gather context that is useful for response generation. These RAG systems typically optimize strictly towards retrieving context that is maximally relevant to the query. However, conventional theory suggests that retrieval systems which seek to maximize context relevance without any additional explicit criteria can create information bottlenecks. We reaffirm this finding in the modern age of LLM's by showing that in standard RAG pipelines, maximizing for context relevance alone can degrade downstream response quality. In response, we show evaluations of existing RAG methods which account for both context relevance and answer quality. These evaluations introduce a novel finding that existing RAG systems scale poorly with inference time compute usage when considering our combined metric. We introduce "RErank BEyond reLevance (REBEL)", which enables RAG systems to scale with inference-time compute via injection of multi-criteria optimization using Chain-of-Thought prompting (and optionally Multi-Turn dialogue). Ultimately, this enables a new performance/speed tradeoff curve, where RAG systems are able to achieve both higher relevance of retrieved contexts and superior answer quality as inference time increases. Code for the implementation of our method in llama-index can be found at the following PR: https://github.com/run-llama/llama_index/pull/17590. Code for running experiments using this llama-index implementation can be found at https://github.com/microsoft/REBEL.

Paper Structure

This paper contains 48 sections, 6 figures.

Figures (6)

  • Figure 1: (Left) Comparison of retrieval methods showing retrieval precision versus answer similarity, with error bars indicating 95% confidence intervals. The dashed best-fit lines represent the previously posited information bottleneck (blue) and the surpassing of that bottleneck by our multi-criteria rerankers (red). The one-turn version uses five fixed criteria (depth, diversity, clarity, authoritativeness, and recency) to achieve both higher retrieval relevance and answer quality than vanilla RAG (No Rerank). The two-turn version further improves performance by adapting criteria to each query through a two-turn prompting process. (Right) Visualization of system quality (measured by the multiplication of answer similarity and retrieval precision) and system inference speed (measured by generated output characters per second) for each method. We note that existing relevance-only methods are not able to achieve higher system quality at efficient inference speed rates, while our multi-criteria methods enable a new RAG tradeoff curve where inference compute can be leveraged to greatly increase system quality.
  • Figure 2: The two-turn version of REBEL Rerank enhances RAG systems by generating query-dependent reranking prompts that guide document selection based on both relevance and secondary criteria (such as authoritativeness, diversity, and recency) inferred from the user query. The Reranking Prompt Generator creates custom prompts that help the Reranker evaluate retrieved documents using a comprehensive scoring system that extends beyond simple relevance matching. Our experiments show that this approach maintains high retrieval relevance while significantly improving end-to-end answer quality, challenging the conventional assumption that maximizing relevance alone leads to optimal results. This finding suggests that the quality of RAG-generated responses depends not just on the topical relevance of retrieved documents, but on a broader set of contextual criteria that vary by query type and domain.
  • Figure 3: An overview of a Retrieval Augmented Generation (RAG) pipeline, including the usages of user queries in retrieving documents and documents in response generation. Inspired by eibich2024aragog.
  • Figure 4: An overview of reranking within a RAG system. This shows how a set of $k$ retrieved documents are further refined to a set of $n$ more curated set of documents, followed by these $n$ documents being used for generation. Inspired by eibich2024aragog.
  • Figure 5: Detailed view of the calculation of retrieval precision. Inspired by eibich2024aragog.
  • ...and 1 more figures