Relevance Isn't All You Need: Scaling RAG Systems With Inference-Time Compute Via Multi-Criteria Reranking

Will LeVine; Bijan Varjavand

Relevance Isn't All You Need: Scaling RAG Systems With Inference-Time Compute Via Multi-Criteria Reranking

Will LeVine, Bijan Varjavand

TL;DR

REBEL introduces a multi-criteria reranking framework for retrieval-augmented generation that uses Chain-of-Thought prompting to optimize not only relevance but also secondary properties such as depth, diversity, clarity, authoritativeness, and recency. It proposes two strategies: a fixed one-turn approach and a dynamic two-turn approach that infer query-specific criteria, yielding improved end-to-end performance measured by answer similarity and retrieval precision. Through a carefully controlled experimental setup with a large arXiv-derived dataset and multiple LLMs, the authors show that incorporating secondary criteria can overcome the traditional relevance-only information bottleneck and enable a favorable compute/quality tradeoff curve. The work also discusses safety considerations, potential extensions, and provides public code resources to reproduce and build upon the results.

Abstract

Modern Large Language Model (LLM) systems typically rely on Retrieval Augmented Generation (RAG) which aims to gather context that is useful for response generation. These RAG systems typically optimize strictly towards retrieving context that is maximally relevant to the query. However, conventional theory suggests that retrieval systems which seek to maximize context relevance without any additional explicit criteria can create information bottlenecks. We reaffirm this finding in the modern age of LLM's by showing that in standard RAG pipelines, maximizing for context relevance alone can degrade downstream response quality. In response, we show evaluations of existing RAG methods which account for both context relevance and answer quality. These evaluations introduce a novel finding that existing RAG systems scale poorly with inference time compute usage when considering our combined metric. We introduce "RErank BEyond reLevance (REBEL)", which enables RAG systems to scale with inference-time compute via injection of multi-criteria optimization using Chain-of-Thought prompting (and optionally Multi-Turn dialogue). Ultimately, this enables a new performance/speed tradeoff curve, where RAG systems are able to achieve both higher relevance of retrieved contexts and superior answer quality as inference time increases. Code for the implementation of our method in llama-index can be found at the following PR: https://github.com/run-llama/llama_index/pull/17590. Code for running experiments using this llama-index implementation can be found at https://github.com/microsoft/REBEL.

Relevance Isn't All You Need: Scaling RAG Systems With Inference-Time Compute Via Multi-Criteria Reranking

TL;DR

Abstract

Relevance Isn't All You Need: Scaling RAG Systems With Inference-Time Compute Via Multi-Criteria Reranking

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)