Table of Contents
Fetching ...

MuISQA: Multi-Intent Retrieval-Augmented Generation for Scientific Question Answering

Zhiyuan Li, Haisheng Yu, Guangchuan Guo, Nan Zhou, Jiajun Zhang

TL;DR

MuISQA tackles multi-intent scientific question answering by introducing a dedicated benchmark and an intent-aware retrieval framework. The method uses Hypothetical Query Generation to decompose LLM-hypothesized answers into intent-specific queries and applies Reciprocal Rank Fusion to fuse evidence from diverse passages, achieving improved coverage and accuracy across MuISQA and general RAG benchmarks. The evaluation reveals that explicit intent diversification increases information coverage ($\text{IRR}$) and reduces retrieval redundancy, with notable gains even when hypothetical generation is imperfect. The work provides a practical approach to building more reliable RAG systems for complex scientific QA and offers a fine-grained diagnostic suite for future improvements.

Abstract

Complex scientific questions often entail multiple intents, such as identifying gene mutations and linking them to related diseases. These tasks require evidence from diverse sources and multi-hop reasoning, while conventional retrieval-augmented generation (RAG) systems are usually single-intent oriented, leading to incomplete evidence coverage. To assess this limitation, we introduce the Multi-Intent Scientific Question Answering (MuISQA) benchmark, which is designed to evaluate RAG systems on heterogeneous evidence coverage across sub-questions. In addition, we propose an intent-aware retrieval framework that leverages large language models (LLMs) to hypothesize potential answers, decompose them into intent-specific queries, and retrieve supporting passages for each underlying intent. The retrieved fragments are then aggregated and re-ranked via Reciprocal Rank Fusion (RRF) to balance coverage across diverse intents while reducing redundancy. Experiments on both MuISQA benchmark and other general RAG datasets demonstrate that our method consistently outperforms conventional approaches, particularly in retrieval accuracy and evidence coverage.

MuISQA: Multi-Intent Retrieval-Augmented Generation for Scientific Question Answering

TL;DR

MuISQA tackles multi-intent scientific question answering by introducing a dedicated benchmark and an intent-aware retrieval framework. The method uses Hypothetical Query Generation to decompose LLM-hypothesized answers into intent-specific queries and applies Reciprocal Rank Fusion to fuse evidence from diverse passages, achieving improved coverage and accuracy across MuISQA and general RAG benchmarks. The evaluation reveals that explicit intent diversification increases information coverage () and reduces retrieval redundancy, with notable gains even when hypothetical generation is imperfect. The work provides a practical approach to building more reliable RAG systems for complex scientific QA and offers a fine-grained diagnostic suite for future improvements.

Abstract

Complex scientific questions often entail multiple intents, such as identifying gene mutations and linking them to related diseases. These tasks require evidence from diverse sources and multi-hop reasoning, while conventional retrieval-augmented generation (RAG) systems are usually single-intent oriented, leading to incomplete evidence coverage. To assess this limitation, we introduce the Multi-Intent Scientific Question Answering (MuISQA) benchmark, which is designed to evaluate RAG systems on heterogeneous evidence coverage across sub-questions. In addition, we propose an intent-aware retrieval framework that leverages large language models (LLMs) to hypothesize potential answers, decompose them into intent-specific queries, and retrieve supporting passages for each underlying intent. The retrieved fragments are then aggregated and re-ranked via Reciprocal Rank Fusion (RRF) to balance coverage across diverse intents while reducing redundancy. Experiments on both MuISQA benchmark and other general RAG datasets demonstrate that our method consistently outperforms conventional approaches, particularly in retrieval accuracy and evidence coverage.

Paper Structure

This paper contains 27 sections, 10 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: An example from our MuISQA benchmark, challenging RAG systems with multi-document retrieval and multi-hop reasoning.
  • Figure 2: The structure of MuISQA benchmark construction. It follows synthesis in four core levels: (1) Topic and article collection, (2) LLM-based pre-annotation, (3) Human verification, and (4) Evaluation metrics design.
  • Figure 3: The overview of our proposed intent-aware retrieval framework. The LLM first generates hypothetical answers and decomposes them into diverse intent-specific queries. These queries are then used to retrieve relevant document chunks, which are re-ranked using the RRF algorithm to ensure comprehensive coverage of evidence.
  • Figure 4: The UMAP visualization of an example from the MuISQA dataset. Best viewed by zooming in.
  • Figure 5: The case studies on HotpotQA and MusiQue benchmarks. Best viewed by zooming in.
  • ...and 3 more figures