Table of Contents
Fetching ...

IM-RAG: Multi-Round Retrieval-Augmented Generation Through Learning Inner Monologues

Diji Yang, Jinmeng Rao, Kezhen Chen, Xiaoyuan Guo, Yawen Zhang, Jie Yang, Yi Zhang

TL;DR

IM-RAG addresses the shortcomings of existing retrieval-augmented systems by enabling context-aware, multi-round retrieval through learning Inner Monologues between a Reasoner and external IR modules. It introduces a Refiner to adapt IR outputs and a Progress Tracker to provide mid-step reinforcement signals, with the Reasoner switching between querying and answering roles, guided by PPO-based RL and SFT. Evaluated on HotPotQA, IM-RAG achieves state-of-the-art results while offering improved interpretability via learned inner monologues and flexible IR integration. The approach demonstrates strong potential for robust, grounded, multi-hop QA in real-world settings, albeit with higher inference latency and data requirements for optimal progress signaling.

Abstract

Although the Retrieval-Augmented Generation (RAG) paradigms can use external knowledge to enhance and ground the outputs of Large Language Models (LLMs) to mitigate generative hallucinations and static knowledge base problems, they still suffer from limited flexibility in adopting Information Retrieval (IR) systems with varying capabilities, constrained interpretability during the multi-round retrieval process, and a lack of end-to-end optimization. To address these challenges, we propose a novel LLM-centric approach, IM-RAG, that integrates IR systems with LLMs to support multi-round RAG through learning Inner Monologues (IM, i.e., the human inner voice that narrates one's thoughts). During the IM process, the LLM serves as the core reasoning model (i.e., Reasoner) to either propose queries to collect more information via the Retriever or to provide a final answer based on the conversational context. We also introduce a Refiner that improves the outputs from the Retriever, effectively bridging the gap between the Reasoner and IR modules with varying capabilities and fostering multi-round communications. The entire IM process is optimized via Reinforcement Learning (RL) where a Progress Tracker is incorporated to provide mid-step rewards, and the answer prediction is further separately optimized via Supervised Fine-Tuning (SFT). We conduct extensive experiments with the HotPotQA dataset, a popular benchmark for retrieval-based, multi-step question-answering. The results show that our approach achieves state-of-the-art (SOTA) performance while providing high flexibility in integrating IR modules as well as strong interpretability exhibited in the learned inner monologues.

IM-RAG: Multi-Round Retrieval-Augmented Generation Through Learning Inner Monologues

TL;DR

IM-RAG addresses the shortcomings of existing retrieval-augmented systems by enabling context-aware, multi-round retrieval through learning Inner Monologues between a Reasoner and external IR modules. It introduces a Refiner to adapt IR outputs and a Progress Tracker to provide mid-step reinforcement signals, with the Reasoner switching between querying and answering roles, guided by PPO-based RL and SFT. Evaluated on HotPotQA, IM-RAG achieves state-of-the-art results while offering improved interpretability via learned inner monologues and flexible IR integration. The approach demonstrates strong potential for robust, grounded, multi-hop QA in real-world settings, albeit with higher inference latency and data requirements for optimal progress signaling.

Abstract

Although the Retrieval-Augmented Generation (RAG) paradigms can use external knowledge to enhance and ground the outputs of Large Language Models (LLMs) to mitigate generative hallucinations and static knowledge base problems, they still suffer from limited flexibility in adopting Information Retrieval (IR) systems with varying capabilities, constrained interpretability during the multi-round retrieval process, and a lack of end-to-end optimization. To address these challenges, we propose a novel LLM-centric approach, IM-RAG, that integrates IR systems with LLMs to support multi-round RAG through learning Inner Monologues (IM, i.e., the human inner voice that narrates one's thoughts). During the IM process, the LLM serves as the core reasoning model (i.e., Reasoner) to either propose queries to collect more information via the Retriever or to provide a final answer based on the conversational context. We also introduce a Refiner that improves the outputs from the Retriever, effectively bridging the gap between the Reasoner and IR modules with varying capabilities and fostering multi-round communications. The entire IM process is optimized via Reinforcement Learning (RL) where a Progress Tracker is incorporated to provide mid-step rewards, and the answer prediction is further separately optimized via Supervised Fine-Tuning (SFT). We conduct extensive experiments with the HotPotQA dataset, a popular benchmark for retrieval-based, multi-step question-answering. The results show that our approach achieves state-of-the-art (SOTA) performance while providing high flexibility in integrating IR modules as well as strong interpretability exhibited in the learned inner monologues.
Paper Structure (35 sections, 6 equations, 2 figures, 3 tables, 1 algorithm)

This paper contains 35 sections, 6 equations, 2 figures, 3 tables, 1 algorithm.

Figures (2)

  • Figure 1: The Inner Monologue (IM) process in IM-RAG. For users posed questions, the Reasoner first determines if it has enough information to provide an answer. If not, it acts as a Questioner, proposing a query to request more information. The query is then directed to the Retriever, which searches for relevant documents in the knowledge source. Subsequently, the Refiner refines the retrieved documents to highlight the most pertinent information, which is then returned to the Reasoner. This iterative process may continue over multiple rounds until the Reasoner believes it has gathered enough information, at which point it becomes an Answerer and generates a final answer. This IM process provides valuable insights into the reasoning process, enabling humans to understand how the system arrived at its conclusions.
  • Figure 2: Overview of IM-RAG framework. It involves four main components: a Reasoner, a Retriever, a Refiner, and a Progress Tracker. The Reasoner is responsible for core reasoning, switching its role between Questioner (learning to propose queries to request relevant documents via the Retriever) and Answerer (learning to predict a final answer based on the conversational context). The Refiner improves the retrieved documents via rephrasing or reranking and passes the top-k highlighted documents to both the Progress Tracker for predicting progress scores and the Reasoner for further reasoning. The training of Questioner happens during the RL stage, where the progress scores are used as rewards. The training of Answerer happens during the SFT stage, where the original questions, learned IM with refined top-k documents at each turn, and ground truth answers are used as finetuning examples.