Table of Contents
Fetching ...

Q-RAG: Long Context Multi-step Retrieval via Value-based Embedder Training

Artyom Sorokin, Nazar Buzun, Alexander Anokhin, Oleg Inozemcev, Egor Vedernikov, Petr Anokhin, Mikhail Burtsev, Trushkov Alexey, Yin Wenshuai, Evgeny Burnaev

TL;DR

Q-RAG tackles the challenge of long-context, multi-step retrieval by training a value-based RL agent directly in the embedder latent space, avoiding costly LLM fine-tuning. The method employs two embedders with an inner-product Q-function, soft Q-learning via PQN, on-policy training with a $\lambda$-return, and a temporal reasoning mechanism through a relative positional encoding $\rho_t(i)$ to capture dependencies across retrieved facts. It achieves state-of-the-art results on Babilong and RULER for contexts up to $10^7$ tokens and shows competitive Open-domain QA performance on HotpotQA and Musique, while being significantly more compute-efficient (training on a single $A100$-class GPU). The approach offers practical benefits for pairing with powerful proprietary LLMs and scales to ultra-long documents, with promising directions in using richer LLM feedback as rewards and deeper integration with generation.

Abstract

Retrieval-Augmented Generation (RAG) methods enhance LLM performance by efficiently filtering relevant context for LLMs, reducing hallucinations and inference cost. However, most existing RAG methods focus on single-step retrieval, which is often insufficient for answering complex questions that require multi-step search. Recently, multi-step retrieval approaches have emerged, typically involving the fine-tuning of small LLMs to perform multi-step retrieval. This type of fine-tuning is highly resource-intensive and does not enable the use of larger LLMs. In this work, we propose Q-RAG, a novel approach that fine-tunes the Embedder model for multi-step retrieval using reinforcement learning (RL). Q-RAG offers a competitive, resource-efficient alternative to existing multi-step retrieval methods for open-domain question answering and achieves state-of-the-art results on the popular long-context benchmarks Babilong and RULER for contexts up to 10M tokens.

Q-RAG: Long Context Multi-step Retrieval via Value-based Embedder Training

TL;DR

Q-RAG tackles the challenge of long-context, multi-step retrieval by training a value-based RL agent directly in the embedder latent space, avoiding costly LLM fine-tuning. The method employs two embedders with an inner-product Q-function, soft Q-learning via PQN, on-policy training with a -return, and a temporal reasoning mechanism through a relative positional encoding to capture dependencies across retrieved facts. It achieves state-of-the-art results on Babilong and RULER for contexts up to tokens and shows competitive Open-domain QA performance on HotpotQA and Musique, while being significantly more compute-efficient (training on a single -class GPU). The approach offers practical benefits for pairing with powerful proprietary LLMs and scales to ultra-long documents, with promising directions in using richer LLM feedback as rewards and deeper integration with generation.

Abstract

Retrieval-Augmented Generation (RAG) methods enhance LLM performance by efficiently filtering relevant context for LLMs, reducing hallucinations and inference cost. However, most existing RAG methods focus on single-step retrieval, which is often insufficient for answering complex questions that require multi-step search. Recently, multi-step retrieval approaches have emerged, typically involving the fine-tuning of small LLMs to perform multi-step retrieval. This type of fine-tuning is highly resource-intensive and does not enable the use of larger LLMs. In this work, we propose Q-RAG, a novel approach that fine-tunes the Embedder model for multi-step retrieval using reinforcement learning (RL). Q-RAG offers a competitive, resource-efficient alternative to existing multi-step retrieval methods for open-domain question answering and achieves state-of-the-art results on the popular long-context benchmarks Babilong and RULER for contexts up to 10M tokens.

Paper Structure

This paper contains 20 sections, 3 theorems, 37 equations, 3 figures, 3 tables, 1 algorithm.

Key Result

Theorem 1

Let $X \subset \mathbb{R}^{d_x}$, $Y \subset \mathbb{R}^{d_y}$, and $T \subset \mathbb{R}$ be compact sets, and define the compact domain $K = X \times Y \times T$. Let $C(K, \mathbb{R})$ be the space of continuous real-valued functions on $K$ equipped with the uniform norm. Let $R_t$ be the RoPE ma Then $\mathcal{A}$ is dense in $C(K, \mathbb{R})$. That is, for any $f \in C(K, \mathbb{R})$ and $\

Figures (3)

  • Figure 1: Q-RAG agent interacts with multi-step retrieval environment. The starting state $s_0$ contains the initial query $q$. At the start of the episode, the agent embeds all chunks of the long context ${\mathbb{C}}$. At each step $t$, the agent computes a vector embedding of the current state $s_t$, which includes $q$ and all previously selected chunks. For every chunk $c^i \in {\mathbb{A}}_t$, the utility of retrieving it is evaluated by the $Q$-function $Q_\theta(s_t, a=c^i)$. The policy $\pi_\theta$ selects the next chunk from ${\mathbb{A}}_t$ with probability proportional to its $Q_\theta(s_t,c^i)$ value.
  • Figure 2: Comparison of answer accuracy on the long-context benchmark Babilong. Solid lines denote methods fine-tuned on the Babilong, while dashed lines denote zero-shot methods. a) Average performance across tasks Q1–QA5. b) Performance on the hardest task, QA3, which requires the longest reasoning chain and temporal awareness.
  • Figure 3: Ablation for (a) policy entropy coefficient ($\alpha$) in soft Q function and (b) for $\lambda$-return parameter. Inference runtime comparison (c), context length, tokens on x-axes.

Theorems & Definitions (6)

  • Theorem 1
  • proof
  • Theorem 2: Convergence Rate
  • proof
  • Lemma 1
  • proof