Thinking Forward and Backward: Multi-Objective Reinforcement Learning for Retrieval-Augmented Reasoning
Wenda Wei, Yu-An Liu, Ruqing Zhang, Jiafeng Guo, Lixin Su, Shuaiqiang Wang, Dawei Yin, Maarten de Rijke, Xueqi Cheng
TL;DR
Bi-RAR tackles the challenge of guiding multi-step retrieval-augmented reasoning by introducing bidirectional information distances that quantify, at each step, both progress toward the final answer and grounding in the original question. It combines forward and backward signals within a multi-objective RL framework with cascading rewards, training two specialized models via GRPO and interpolating their parameters with a factor $\lambda \in [0,1]$ to obtain a balanced Bi-RAR model. Empirically, Bi-RAR yields substantial gains across seven QA benchmarks, outperforming the strong Search-R1 baseline while using only a quarter of its training data, and achieves shorter, more efficient reasoning with fewer retrievals on average. The approach provides a principled step-level supervision mechanism for retrieval-augmented reasoning, enhancing reliability and efficiency of real-time search interactions, though it relies on approximating Kolmogorov complexity via LM probabilities and incurs computational costs that warrant further optimization.
Abstract
Retrieval-augmented generation (RAG) has proven to be effective in mitigating hallucinations in large language models, yet its effectiveness remains limited in complex, multi-step reasoning scenarios. Recent efforts have incorporated search-based interactions into RAG, enabling iterative reasoning with real-time retrieval. Most approaches rely on outcome-based supervision, offering no explicit guidance for intermediate steps. This often leads to reward hacking and degraded response quality. We propose Bi-RAR, a novel retrieval-augmented reasoning framework that evaluates each intermediate step jointly in both forward and backward directions. To assess the information completeness of each step, we introduce a bidirectional information distance grounded in Kolmogorov complexity, approximated via language model generation probabilities. This quantification measures both how far the current reasoning is from the answer and how well it addresses the question. To optimize reasoning under these bidirectional signals, we adopt a multi-objective reinforcement learning framework with a cascading reward structure that emphasizes early trajectory alignment. Empirical results on seven question answering benchmarks demonstrate that Bi-RAR surpasses previous methods and enables efficient interaction and reasoning with the search engine during training and inference.
