Q-RAG: Long Context Multi-step Retrieval via Value-based Embedder Training
Artyom Sorokin, Nazar Buzun, Alexander Anokhin, Oleg Inozemcev, Egor Vedernikov, Petr Anokhin, Mikhail Burtsev, Trushkov Alexey, Yin Wenshuai, Evgeny Burnaev
TL;DR
Q-RAG tackles the challenge of long-context, multi-step retrieval by training a value-based RL agent directly in the embedder latent space, avoiding costly LLM fine-tuning. The method employs two embedders with an inner-product Q-function, soft Q-learning via PQN, on-policy training with a $\lambda$-return, and a temporal reasoning mechanism through a relative positional encoding $\rho_t(i)$ to capture dependencies across retrieved facts. It achieves state-of-the-art results on Babilong and RULER for contexts up to $10^7$ tokens and shows competitive Open-domain QA performance on HotpotQA and Musique, while being significantly more compute-efficient (training on a single $A100$-class GPU). The approach offers practical benefits for pairing with powerful proprietary LLMs and scales to ultra-long documents, with promising directions in using richer LLM feedback as rewards and deeper integration with generation.
Abstract
Retrieval-Augmented Generation (RAG) methods enhance LLM performance by efficiently filtering relevant context for LLMs, reducing hallucinations and inference cost. However, most existing RAG methods focus on single-step retrieval, which is often insufficient for answering complex questions that require multi-step search. Recently, multi-step retrieval approaches have emerged, typically involving the fine-tuning of small LLMs to perform multi-step retrieval. This type of fine-tuning is highly resource-intensive and does not enable the use of larger LLMs. In this work, we propose Q-RAG, a novel approach that fine-tunes the Embedder model for multi-step retrieval using reinforcement learning (RL). Q-RAG offers a competitive, resource-efficient alternative to existing multi-step retrieval methods for open-domain question answering and achieves state-of-the-art results on the popular long-context benchmarks Babilong and RULER for contexts up to 10M tokens.
