To Generate or to Retrieve? On the Effectiveness of Artificial Contexts for Medical Open-Domain Question Answering

Giacomo Frisoni; Alessio Cocchieri; Alex Presepi; Gianluca Moro; Zaiqiao Meng

To Generate or to Retrieve? On the Effectiveness of Artificial Contexts for Medical Open-Domain Question Answering

Giacomo Frisoni, Alessio Cocchieri, Alex Presepi, Gianluca Moro, Zaiqiao Meng

TL;DR

MedGENIE introduces a fully generative framework for medical open-domain QA that grounds questions using multi-view artificial contexts generated by a medical LLM. The approach provides two reading pathways (unsupervised ICL and supervised FID) and demonstrates state-of-the-art open-book performance on MedQA-USMLE, MedMCQA, and MMLU-Medical with substantially fewer parameters than large closed-book models. Generated contexts outperform retrieved passages in grounding reader models, and when combined with retrieval (RAG), artificial contexts further boost accuracy. The work highlights the potential of synthetic grounding to reduce computational demands while maintaining or improving medical QA performance, though it acknowledges limitations such as potential hallucinations and knowledge updates. Overall, MedGENIE advances resource-efficient medical QA and invites future exploration of generation-based grounding and hybrid RAG strategies in dynamic clinical knowledge spaces.

Abstract

Medical open-domain question answering demands substantial access to specialized knowledge. Recent efforts have sought to decouple knowledge from model parameters, counteracting architectural scaling and allowing for training on common low-resource hardware. The retrieve-then-read paradigm has become ubiquitous, with model predictions grounded on relevant knowledge pieces from external repositories such as PubMed, textbooks, and UMLS. An alternative path, still under-explored but made possible by the advent of domain-specific large language models, entails constructing artificial contexts through prompting. As a result, "to generate or to retrieve" is the modern equivalent of Hamlet's dilemma. This paper presents MedGENIE, the first generate-then-read framework for multiple-choice question answering in medicine. We conduct extensive experiments on MedQA-USMLE, MedMCQA, and MMLU, incorporating a practical perspective by assuming a maximum of 24GB VRAM. MedGENIE sets a new state-of-the-art in the open-book setting of each testbed, allowing a small-scale reader to outcompete zero-shot closed-book 175B baselines while using up to 706$\times$ fewer parameters. Our findings reveal that generated passages are more effective than retrieved ones in attaining higher accuracy.

To Generate or to Retrieve? On the Effectiveness of Artificial Contexts for Medical Open-Domain Question Answering

TL;DR

Abstract

fewer parameters. Our findings reveal that generated passages are more effective than retrieved ones in attaining higher accuracy.

Paper Structure (50 sections, 3 equations, 19 figures, 18 tables)

This paper contains 50 sections, 3 equations, 19 figures, 18 tables.

Introduction
Related Work
Medical Language Models
Open-Book Question Answering
Method
Problem Statement
Multi-view artificial contexts
ICL reader (Unsupervised)
Fine-tuned reader (Supervised)
Experimental Setup
Benchmarks
MedQA-USMLE DBLP:journals/corr/abs-2009-13081
MedMCQA DBLP:conf/chil/PalUS22
MMLU-Medical DBLP:conf/iclr/HendrycksBBZMSS21
Medical-expert generator
...and 35 more sections

Figures (19)

Figure 1: MedGENIE performance (Flan-T5-base, Fusion-In-Decoder) on USMLE-style questions. Comparison against fine-tuned open-source baselines with a maximum of 10B parameters, using the MedQA (4 options) test set. Model size displayed on a log scale.
Figure 2: Overview of the MedGENIE framework. It generates multi-view artificial contexts with a specialized LLM (top), and then uses them to ground a prompted LLM or a fine-tuned SLM (bottom).
Figure 3: Example of multi-view context generation for a MedMCQA eval instance. The knowledge verbalized by a medical LLM is highly valuable in determining the correct answer (unseen by the generator).
Figure 4: Percentage of multi-view generated contexts compared to MedWiki-retrieved contexts in the top-$K$ positions of a BGE-large reranker.
Figure 5: Word-level length distribution of PMC-LLaMA artificial contexts.
...and 14 more figures

To Generate or to Retrieve? On the Effectiveness of Artificial Contexts for Medical Open-Domain Question Answering

TL;DR

Abstract

To Generate or to Retrieve? On the Effectiveness of Artificial Contexts for Medical Open-Domain Question Answering

Authors

TL;DR

Abstract

Table of Contents

Figures (19)