FB-RAG: Improving RAG with Forward and Backward Lookup
Kushal Chawla, Alfy Samuel, Anoop Kumar, Daben Liu
TL;DR
FB-RAG tackles the bottleneck of traditional RAG when queries lack strong signals by introducing a forward-looking lookahead mechanism that guides retrieval before final generation. It couples a forward-looking component, derived from multiple samples of a lightweight LLM, with a backward query signal to compute a forward-backward score S_FB for each context chunk, enabling precise chunk selection in a three-stage pipeline (recall-focused retrieval, precision-focused retrieval, and generation). The framework is training-free, relies on off-the-shelf retrievers, and demonstrates consistent gains across 9 LongBench and ∞Bench datasets, including substantial latency reductions on EN.QA. Key findings show that forward signals can improve chunk ranking even when the small LLM fails on some samples, and that using a lighter forward-lookup model with a stronger final generator can yield better performance-latency tradeoffs. The results position FB-RAG as a practical, scalable enhancement for long-context QA tasks, with clear guidance on when to prioritize forward-only retrieval vs. forward+backward scoring and how to balance latency and accuracy in real-world deployments.
Abstract
Traditional Retrieval-Augmented Generation (RAG) struggles with complex queries that lack strong signals to retrieve the most relevant context, forcing a trade-off between choosing a small context that misses key information and a large context that confuses the LLM. To address this, we propose Forward-Backward RAG (FB-RAG), a new training-free framework based on a simple yet powerful forward-looking strategy. FB-RAG employs a light-weight LLM to peek into potential future generations, using evidence from multiple sampled outputs to precisely identify the most relevant context for a final, more powerful generator. This improves performance without complex finetuning or Reinforcement Learning common in prior work. Across $9$ datasets from LongBench and $\infty$Bench, FB-RAG consistently delivers strong results. Further, the performance gains can be achieved with reduced latency due to a shorter, more focused prompt for the powerful generator. On EN.QA dataset, FB-RAG matches the leading baseline with over $48$% latency reduction or achieves an $8$% performance improvement with a $10$% latency reduction. Our analysis finds cases where even when the forward-looking LLM fails to generate correct answers, its attempts are sufficient to guide the final model to an accurate response, demonstrating how smaller LLMs can systematically improve the performance and efficiency of larger ones.
