Table of Contents
Fetching ...

Inference Scaling for Bridging Retrieval and Augmented Generation

Youngwon Lee, Seung-won Hwang, Daniel Campos, Filip Graliński, Zhewei Yao, Yuxiong He

TL;DR

This work tackles generator position bias in retrieval-augmented generation (RAG) by introducing Mixture-of-Intervention (MoI), an inference-time method that treats retrieved passages as interventions and decomposes their true utility $u_p$ from position bias $a_j$ using multiple parallel permutations. MoI employs a strategized propose-aggregate framework to efficiently estimate $u_p$ and $a_j$, enabling debiased ranking without training a separate bridge module. The approach leverages retriever priors to prune the search space and can use smaller agents via preference distillation to reduce computational cost. Across MS MARCO, HotpotQA, CRAG, and FEVER, MoI yields significant improvements in downstream metrics (e.g., ROUGE-L and EM) and demonstrates cost savings and robustness across model scales, highlighting its practical impact for improving RAG systems without retraining components.

Abstract

Retrieval-augmented generation (RAG) has emerged as a popular approach to steering the output of a large language model (LLM) by incorporating retrieved contexts as inputs. However, existing work observed the generator bias, such that improving the retrieval results may negatively affect the outcome. In this work, we show such bias can be mitigated, from inference scaling, aggregating inference calls from the permuted order of retrieved contexts. The proposed Mixture-of-Intervention (MOI) explicitly models the debiased utility of each passage with multiple forward passes to construct a new ranking. We also show that MOI can leverage the retriever's prior knowledge to reduce the computational cost by minimizing the number of permutations considered and lowering the cost per LLM call. We showcase the effectiveness of MOI on diverse RAG tasks, improving ROUGE-L on MS MARCO and EM on HotpotQA benchmarks by ~7 points.

Inference Scaling for Bridging Retrieval and Augmented Generation

TL;DR

This work tackles generator position bias in retrieval-augmented generation (RAG) by introducing Mixture-of-Intervention (MoI), an inference-time method that treats retrieved passages as interventions and decomposes their true utility from position bias using multiple parallel permutations. MoI employs a strategized propose-aggregate framework to efficiently estimate and , enabling debiased ranking without training a separate bridge module. The approach leverages retriever priors to prune the search space and can use smaller agents via preference distillation to reduce computational cost. Across MS MARCO, HotpotQA, CRAG, and FEVER, MoI yields significant improvements in downstream metrics (e.g., ROUGE-L and EM) and demonstrates cost savings and robustness across model scales, highlighting its practical impact for improving RAG systems without retraining components.

Abstract

Retrieval-augmented generation (RAG) has emerged as a popular approach to steering the output of a large language model (LLM) by incorporating retrieved contexts as inputs. However, existing work observed the generator bias, such that improving the retrieval results may negatively affect the outcome. In this work, we show such bias can be mitigated, from inference scaling, aggregating inference calls from the permuted order of retrieved contexts. The proposed Mixture-of-Intervention (MOI) explicitly models the debiased utility of each passage with multiple forward passes to construct a new ranking. We also show that MOI can leverage the retriever's prior knowledge to reduce the computational cost by minimizing the number of permutations considered and lowering the cost per LLM call. We showcase the effectiveness of MOI on diverse RAG tasks, improving ROUGE-L on MS MARCO and EM on HotpotQA benchmarks by ~7 points.

Paper Structure

This paper contains 36 sections, 12 equations, 7 figures, 13 tables.

Figures (7)

  • Figure 1: (Left, RAG) Top-10 passages retrieved by a complex retrieval system involving the Bing search engine are fed to the generator. (Center) RankGPT, a strong reranker based on LLM, hurts the performance, even more severely with stronger backbone. (Right) MoI improves the answer quality, outperforming RAG without reranking by a large margin of 6 points in accuracy.
  • Figure 2: (A, baseline) Self-consistency selfconsistency and MoA wang-etal-2024-mixture treat random permutations of passages as black-box and count the consistency vote for outcomes. (B, proposed) In MoI, permutations are treated as white-box intervention of one another, such that, from the obserevations of $p$ in varying positions, MoI estimates the effect of each passage on generation $u$ along the impact of position bias $a$. Finally, the ordering based on debiased utility $u$ is used for generation.
  • Figure 3: Ideally, wherever a passage $p$ is placed, its contribution to generation, or utility, should be constant (blue line). However, due to position bias of LLMs, the observed orange curve varies by the position and surrounding context. MoI disentangles the effect of position bias (left figure) from observation, to determine the debiased utility $u_p$ through multiple parallel interventions.
  • Figure 4: To approximate a comprehensive subset, we consider the set of cyclic permutations as $S$, encompassing diverse yet representative permutations to allow desirable ones to be surfaced.
  • Figure 5: The distribution of $s_i$ from an LLM is distilled to a smaller model by minimizing KL between the normalized probability distributions after softmax. Values colored orange can be pre-computed.
  • ...and 2 more figures