From Conflict to Consensus: Boosting Medical Reasoning via Multi-Round Agentic RAG

Wenhao Wu; Zhentao Tang; Yafu Li; Shixiong Kai; Mingxuan Yuan; Zhenhong Sun; Chunlin Chen; Zhi Wang

From Conflict to Consensus: Boosting Medical Reasoning via Multi-Round Agentic RAG

Wenhao Wu, Zhentao Tang, Yafu Li, Shixiong Kai, Mingxuan Yuan, Zhenhong Sun, Chunlin Chen, Zhi Wang

TL;DR

MA-RAG is proposed, a framework that facilitates test-time scaling for complex medical reasoning by iteratively evolving both external evidence and internal reasoning history within an agentic refinement loop and mirrors a *boosting* mechanism that iteratively minimizes the residual error toward a stable, high-fidelity medical **consensus**.

Abstract

Large Language Models (LLMs) exhibit high reasoning capacity in medical question-answering, but their tendency to produce hallucinations and outdated knowledge poses critical risks in healthcare fields. While Retrieval-Augmented Generation (RAG) mitigates these issues, existing methods rely on noisy token-level signals and lack the multi-round refinement required for complex reasoning. In the paper, we propose **MA-RAG** (**M**ulti-Round **A**gentic RAG), a framework that facilitates test-time scaling for complex medical reasoning by iteratively evolving both external evidence and internal reasoning history within an agentic refinement loop. At each round, the agent transforms semantic **conflict** among candidate responses into actionable queries to retrieve external evidence, while optimizing history reasoning traces to mitigate long-context degradation. MA-RAG extends the *self-consistency* principle by leveraging the lack of consistency as a proactive signal for multi-round agentic reasoning and retrieval, and mirrors a *boosting* mechanism that iteratively minimizes the residual error toward a stable, high-fidelity medical **consensus**. Extensive evaluations across 7 medical Q&A benchmarks show that MA-RAG consistently surpasses competitive inference-time scaling and RAG baselines, delivering **substantial +6.8 points** on average accuracy over the backbone model. Our code is available at [this url](https://github.com/NJU-RL/MA-RAG).

From Conflict to Consensus: Boosting Medical Reasoning via Multi-Round Agentic RAG

TL;DR

Abstract

Paper Structure (33 sections, 8 equations, 7 figures, 11 tables, 1 algorithm)

This paper contains 33 sections, 8 equations, 7 figures, 11 tables, 1 algorithm.

Introduction
Related Work
Method
Problem Statement
Solver Agent
Retrieval Agent
Ranking Agent
Theoretical Grounding in Classic Principles
Experiments
Main Results
Ablation Study
Test-Time Scaling Analysis
Scalability across Model Scales
Conclusions, Limitations, and Future Work
Impact Statement
...and 18 more sections

Figures (7)

Figure 1: MA-RAG’s agentic workflow that iteratively refines conflict into a unified consensus, achieving superior test-time scaling.
Figure 2: Overall pipeline of MA-RAG (Multi-round Agentic RAG) for complex medical reasoning. At each round $t$ of the agentic refinement loop: i) the Solver Agent first samples a diverse set of candidate responses; ii) the Retrieval Agent transforms semantic conflicts among candidates into actionable queries to retrieve external evidence from a local medical corpus, updating the document context to $\mathcal{D}_{t+1}$; and iii) the Ranking Agent restructures history reasoning traces $\mathcal{A}_t$ by prioritizing top-tier candidates to construct the history context $\mathcal{H}_{t+1}$, mitigating long-context degradation and enhancing in-context learning. The evolved state $S_t=\{\mathcal{I},q,\mathcal{D}_t,\mathcal{H}_t\}$ serves as the prompt at the next round, iteratively rectifying semantic conflict toward converging to a reliable, high-fidelity consensus.
Figure 3: Performance comparison between MA-RAG and the multi-round test-time scaling baseline (Multi-Refine).
Figure 4: Visualization of the response score density on MedMCQA and MMLU-Pro. Intrinsic Uncertainty (left): Correct answers exhibit lower entropy compared to incorrect ones. Extrinsic Verification (right): The fine-tuned BERT-based evaluator assigns higher scores to correct answers, exhibiting a more pronounced discriminative margin than the entropy-based score. These distinct distributions validate the utility of both score functions for the Ranking Agent. Notably, the extrinsic evaluator exhibits superior discriminative power compared to the intrinsic counterpart, a finding consistent with the performance gap observed between MA-RAG-int and MA-RAG-ext in Table \ref{['tab:main_results']}.
Figure 5: Performance of MA-RAG across backbone model scales.
...and 2 more figures

From Conflict to Consensus: Boosting Medical Reasoning via Multi-Round Agentic RAG

TL;DR

Abstract

From Conflict to Consensus: Boosting Medical Reasoning via Multi-Round Agentic RAG

Authors

TL;DR

Abstract

Table of Contents

Figures (7)