Table of Contents
Fetching ...

MIRAGE: Scaling Test-Time Inference with Parallel Graph-Retrieval-Augmented Reasoning Chains

Kaiwen Wei, Rui Shan, Dongsheng Zou, Jianzhong Yang, Bi Zhao, Junnan Zhu, Jiang Zhong

TL;DR

MIRAGE addresses the limitations of linear, unstructured test-time reasoning in medical QA by introducing parallel multi-chain inference over structured medical knowledge graphs. It decomposes queries into entity-grounded sub-questions, performs graph-based evidence retrieval in an adaptive, think-while-search loop, and cross-verifies across chains before synthesizing a concise, provenance-rich answer. Across three medical QA benchmarks, MIRAGE consistently outperforms GPT-4o, Tree-of-Thought variants, and other retrieval-based baselines, while offering superior interpretability through explicit graph-grounded reasoning traces. The approach enhances accuracy, reliability, and auditability in high-stakes medical domains, with code to be released for reproducibility and further research.

Abstract

Large reasoning models (LRMs) have shown significant progress in test-time scaling through chain-of-thought prompting. Current approaches like search-o1 integrate retrieval augmented generation (RAG) into multi-step reasoning processes but rely on a single, linear reasoning chain while incorporating unstructured textual information in a flat, context-agnostic manner. As a result, these approaches can lead to error accumulation throughout the reasoning chain, which significantly limits its effectiveness in medical question-answering (QA) tasks where both accuracy and traceability are critical requirements. To address these challenges, we propose MIRAGE (Multi-chain Inference with Retrieval-Augmented Graph Exploration), a novel test-time scalable reasoning framework that performs dynamic multi-chain inference over structured medical knowledge graphs. Specifically, MIRAGE 1) decomposes complex queries into entity-grounded sub-questions, 2) executes parallel inference chains, 3) retrieves evidence adaptively via neighbor expansion and multi-hop traversal, and 4) integrates answers using cross-chain verification to resolve contradictions. Experiments on three medical QA benchmarks (GenMedGPT-5k, CMCQA, and ExplainCPE) show that MIRAGE consistently outperforms GPT-4o, Tree-of-Thought variants, and other retrieval-augmented baselines in both automatic and human evaluations. Additionally, MIRAGE improves interpretability by generating explicit reasoning chains that trace each factual claim to concrete chains within the knowledge graph, making it well-suited for complex medical reasoning scenarios. The code will be available for further research.

MIRAGE: Scaling Test-Time Inference with Parallel Graph-Retrieval-Augmented Reasoning Chains

TL;DR

MIRAGE addresses the limitations of linear, unstructured test-time reasoning in medical QA by introducing parallel multi-chain inference over structured medical knowledge graphs. It decomposes queries into entity-grounded sub-questions, performs graph-based evidence retrieval in an adaptive, think-while-search loop, and cross-verifies across chains before synthesizing a concise, provenance-rich answer. Across three medical QA benchmarks, MIRAGE consistently outperforms GPT-4o, Tree-of-Thought variants, and other retrieval-based baselines, while offering superior interpretability through explicit graph-grounded reasoning traces. The approach enhances accuracy, reliability, and auditability in high-stakes medical domains, with code to be released for reproducibility and further research.

Abstract

Large reasoning models (LRMs) have shown significant progress in test-time scaling through chain-of-thought prompting. Current approaches like search-o1 integrate retrieval augmented generation (RAG) into multi-step reasoning processes but rely on a single, linear reasoning chain while incorporating unstructured textual information in a flat, context-agnostic manner. As a result, these approaches can lead to error accumulation throughout the reasoning chain, which significantly limits its effectiveness in medical question-answering (QA) tasks where both accuracy and traceability are critical requirements. To address these challenges, we propose MIRAGE (Multi-chain Inference with Retrieval-Augmented Graph Exploration), a novel test-time scalable reasoning framework that performs dynamic multi-chain inference over structured medical knowledge graphs. Specifically, MIRAGE 1) decomposes complex queries into entity-grounded sub-questions, 2) executes parallel inference chains, 3) retrieves evidence adaptively via neighbor expansion and multi-hop traversal, and 4) integrates answers using cross-chain verification to resolve contradictions. Experiments on three medical QA benchmarks (GenMedGPT-5k, CMCQA, and ExplainCPE) show that MIRAGE consistently outperforms GPT-4o, Tree-of-Thought variants, and other retrieval-augmented baselines in both automatic and human evaluations. Additionally, MIRAGE improves interpretability by generating explicit reasoning chains that trace each factual claim to concrete chains within the knowledge graph, making it well-suited for complex medical reasoning scenarios. The code will be available for further research.

Paper Structure

This paper contains 21 sections, 6 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Comparison of medical QA strategies: (a) Static RAG retrieves documents or knowledge graph entries without explicit reasoning; (b) Agentic RAG methods like Search-o1 integrate retrieval with linear reasoning; (c) ToT explores multiple reasoning chains via sampling; (d) The proposed MIRAGE combines these approaches by performing graph-based retrieval across parallel reasoning chains.
  • Figure 2: Overview of the proposed MIRAGE framework. Given a clinical query, the system decomposes it into sub-questions, each initiating a reasoning chain. For each chain, the system iteratively retrieves knowledge graph evidence using either Anchor mode or Bridge mode. Retrieved results are coordinated and aggregated to generate the final answer.
  • Figure 3: Effect of the decomposition threshold $N_q$ (a) and retrieval threshold $N_r$ (b) on GPT-4o Ranking and accuracy.
  • Figure 4: Human evaluation results on GenMedGPT-5k.
  • Figure 5: Case Study comparison between single-chain method and the proposed multi-chain Graph RAG reasoning method.