Table of Contents
Fetching ...

FB-RAG: Improving RAG with Forward and Backward Lookup

Kushal Chawla, Alfy Samuel, Anoop Kumar, Daben Liu

TL;DR

FB-RAG tackles the bottleneck of traditional RAG when queries lack strong signals by introducing a forward-looking lookahead mechanism that guides retrieval before final generation. It couples a forward-looking component, derived from multiple samples of a lightweight LLM, with a backward query signal to compute a forward-backward score S_FB for each context chunk, enabling precise chunk selection in a three-stage pipeline (recall-focused retrieval, precision-focused retrieval, and generation). The framework is training-free, relies on off-the-shelf retrievers, and demonstrates consistent gains across 9 LongBench and ∞Bench datasets, including substantial latency reductions on EN.QA. Key findings show that forward signals can improve chunk ranking even when the small LLM fails on some samples, and that using a lighter forward-lookup model with a stronger final generator can yield better performance-latency tradeoffs. The results position FB-RAG as a practical, scalable enhancement for long-context QA tasks, with clear guidance on when to prioritize forward-only retrieval vs. forward+backward scoring and how to balance latency and accuracy in real-world deployments.

Abstract

Traditional Retrieval-Augmented Generation (RAG) struggles with complex queries that lack strong signals to retrieve the most relevant context, forcing a trade-off between choosing a small context that misses key information and a large context that confuses the LLM. To address this, we propose Forward-Backward RAG (FB-RAG), a new training-free framework based on a simple yet powerful forward-looking strategy. FB-RAG employs a light-weight LLM to peek into potential future generations, using evidence from multiple sampled outputs to precisely identify the most relevant context for a final, more powerful generator. This improves performance without complex finetuning or Reinforcement Learning common in prior work. Across $9$ datasets from LongBench and $\infty$Bench, FB-RAG consistently delivers strong results. Further, the performance gains can be achieved with reduced latency due to a shorter, more focused prompt for the powerful generator. On EN.QA dataset, FB-RAG matches the leading baseline with over $48$% latency reduction or achieves an $8$% performance improvement with a $10$% latency reduction. Our analysis finds cases where even when the forward-looking LLM fails to generate correct answers, its attempts are sufficient to guide the final model to an accurate response, demonstrating how smaller LLMs can systematically improve the performance and efficiency of larger ones.

FB-RAG: Improving RAG with Forward and Backward Lookup

TL;DR

FB-RAG tackles the bottleneck of traditional RAG when queries lack strong signals by introducing a forward-looking lookahead mechanism that guides retrieval before final generation. It couples a forward-looking component, derived from multiple samples of a lightweight LLM, with a backward query signal to compute a forward-backward score S_FB for each context chunk, enabling precise chunk selection in a three-stage pipeline (recall-focused retrieval, precision-focused retrieval, and generation). The framework is training-free, relies on off-the-shelf retrievers, and demonstrates consistent gains across 9 LongBench and ∞Bench datasets, including substantial latency reductions on EN.QA. Key findings show that forward signals can improve chunk ranking even when the small LLM fails on some samples, and that using a lighter forward-lookup model with a stronger final generator can yield better performance-latency tradeoffs. The results position FB-RAG as a practical, scalable enhancement for long-context QA tasks, with clear guidance on when to prioritize forward-only retrieval vs. forward+backward scoring and how to balance latency and accuracy in real-world deployments.

Abstract

Traditional Retrieval-Augmented Generation (RAG) struggles with complex queries that lack strong signals to retrieve the most relevant context, forcing a trade-off between choosing a small context that misses key information and a large context that confuses the LLM. To address this, we propose Forward-Backward RAG (FB-RAG), a new training-free framework based on a simple yet powerful forward-looking strategy. FB-RAG employs a light-weight LLM to peek into potential future generations, using evidence from multiple sampled outputs to precisely identify the most relevant context for a final, more powerful generator. This improves performance without complex finetuning or Reinforcement Learning common in prior work. Across datasets from LongBench and Bench, FB-RAG consistently delivers strong results. Further, the performance gains can be achieved with reduced latency due to a shorter, more focused prompt for the powerful generator. On EN.QA dataset, FB-RAG matches the leading baseline with over % latency reduction or achieves an % performance improvement with a % latency reduction. Our analysis finds cases where even when the forward-looking LLM fails to generate correct answers, its attempts are sufficient to guide the final model to an accurate response, demonstrating how smaller LLMs can systematically improve the performance and efficiency of larger ones.

Paper Structure

This paper contains 21 sections, 7 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Overview of FB-RAG: a training-free framework for generating answers for an input query and context. FB-RAG looks at both the input query and sampled outputs from a light-weight LLM to rank context chunks.
  • Figure 2: Top: Results on EN.QA obtained by varying the number of chunks used for final response generation. Across all data points, our approach uses an Llama3.1-8B-Instruct model for forward lookup in Stage II with $80$ context chunks as input and setting $\eta_F=1$ and $\eta_B=0$. Bottom: Performance vs. Latency plot on EN.QA for the same points as in the Top Figure. Refer to Appendix \ref{['sec:appendix-expt-design']} for details on the hardware used.
  • Figure 3: Varying the model used for Forward lookup in Stage II of our approach. Results are on EN.QA dataset.
  • Figure 4: Performance comparison between our approach and OP RAG on EN.MC dataset. Y-Axis: The performance on the corresponding metric. X-Axis: The number of chunks used by both methods for final response generation. Across all data points, our approach uses an Llama3.1-8B-Instruct model for forward lookup in Stage 2 with $80$ context chunks as input and setting $\eta_F=1$ and $\eta_B=0$.
  • Figure 5: Studying the impact on the average performance of FB-RAG on LongBench datasets by varying the number of samples used in Stage II. Model used: Ours-FB (6k $\to$ 3k).