Med-V1: Small Language Models for Zero-shot and Scalable Biomedical Evidence Attribution

Qiao Jin; Yin Fang; Lauren He; Yifan Yang; Guangzhi Xiong; Zhizheng Wang; Nicholas Wan; Joey Chan; Donald C. Comeau; Robert Leaman; Charalampos S. Floudas; Aidong Zhang; Michael F. Chiang; Yifan Peng; Zhiyong Lu

Med-V1: Small Language Models for Zero-shot and Scalable Biomedical Evidence Attribution

Qiao Jin, Yin Fang, Lauren He, Yifan Yang, Guangzhi Xiong, Zhizheng Wang, Nicholas Wan, Joey Chan, Donald C. Comeau, Robert Leaman, Charalampos S. Floudas, Aidong Zhang, Michael F. Chiang, Yifan Peng, Zhiyong Lu

TL;DR

To efficiently perform biomedical evidence attribution, Med-V1 is presented, a family of small language models with only three billion parameters that performs comparably to frontier LLMs such as GPT-5, along with high-quality explanations for its predictions.

Abstract

Assessing whether an article supports an assertion is essential for hallucination detection and claim verification. While large language models (LLMs) have the potential to automate this task, achieving strong performance requires frontier models such as GPT-5 that are prohibitively expensive to deploy at scale. To efficiently perform biomedical evidence attribution, we present Med-V1, a family of small language models with only three billion parameters. Trained on high-quality synthetic data newly developed in this study, Med-V1 substantially outperforms (+27.0% to +71.3%) its base models on five biomedical benchmarks unified into a verification format. Despite its smaller size, Med-V1 performs comparably to frontier LLMs such as GPT-5, along with high-quality explanations for its predictions. We use Med-V1 to conduct a first-of-its-kind use case study that quantifies hallucinations in LLM-generated answers under different citation instructions. Results show that the format instruction strongly affects citation validity and hallucination, with GPT-5 generating more claims but exhibiting hallucination rates similar to GPT-4o. Additionally, we present a second use case showing that Med-V1 can automatically identify high-stakes evidence misattributions in clinical practice guidelines, revealing potentially negative public health impacts that are otherwise challenging to identify at scale. Overall, Med-V1 provides an efficient and accurate lightweight alternative to frontier LLMs for practical and real-world applications in biomedical evidence attribution and verification tasks. Med-V1 is available at https://github.com/ncbi-nlp/Med-V1.

Med-V1: Small Language Models for Zero-shot and Scalable Biomedical Evidence Attribution

TL;DR

Abstract

Paper Structure (21 sections, 6 figures, 1 table)

This paper contains 21 sections, 6 figures, 1 table.

Introduction
Results
Overview of Med-V1 Training and Inference
MedFact-Synth is Large-scale and of High-quality
Med-V1 Closes the Performance Gap between Frontier and Lightweight LLMs
Error Analysis Reveals High-quality Reasoning Despite Dataset Noise
Use Case 1: Detecting LLM Hallucinations with Med-V1
Use Case 2: Identifying High-Stakes Misattributions with Med-V1
Discussion
Methods
Constructing MedFact-Synth with synthetic data generation
Training Med-V1
Supervised fine-tuning.
Reinforcement learning.
Developing MedFact-Bench
...and 6 more sections

Figures (6)

Figure 1: Overview of Med-V1 training and inference. a: MedFact-Synth construction and Med-V1 training. Synthetic claims are generated from source papers and then verified by a panel of LLMs using relevant papers retrieved from PubMed. The resulting verified claim-evidence pairs form the MedFact-Synth dataset, which is then used to train Med-V1 through a combination of supervised fine-tuning and reinforcement learning. b: Inference with Med-V1. Given an assertion and a source biomedical article, Med-V1 assesses whether the article supports the assertion. The assertions can be derived from Boolean questions, factual claims, or citation statements, corresponding to the applications of question answering, claim verification, and citation attribution, respectively. Med-V1 outputs both a 5-point Likert rating of agreement and a natural-language explanation of its verdict.
Figure 2: Generation and evaluation of MedFact-Synth.a: An example from MedFact-Synth. b: Distribution of veracity labels in MedFact-Synth. c: Word-count distribution of the claims, rationales, and articles in MedFact-Synth. d: Confusion matrices comparing each annotation method with the true labels, which are the ground-truth derived from annotator consensus. Annotations in MedFact-Synth (Synthetic Data) achieve higher accuracy than those of human annotators.
Figure 3: Zero-shot accuracies of different LLMs on MedFact-Bench. Performance is reported for each component dataset, and the (macro-)average accuracy is the main evaluation metric of MedFact-Bench. Frontier LLMs include large-scale open LLMs (e.g., 70B parameters) and the latest proprietary LLMs. Lightweight LLMs are 3B-parameter models. Med-V1-L3B is fine-tuned from Llama-3.2-3B-Instruct (Llama-3B), and Med-V1-Q3B is fine-tuned from Qwen2.5-3B-Instruct (Qwen-3B). Relative improvements over the respective base models are indicated. Both Med-V1 variants show significant performance improvements over their backbones. Llama-70B denotes Llama-3.3-70B-Instruct.
Figure 4: Detecting LLM Hallucination with Med-V1. a: Overview of this use case study. We use Med-V1 to analyze the hallucination rates of different LLMs and citation instructions. b: average number of claims (citation statements) per LLM-generated answer. c: The proportions of the generated citations that can be mapped to a PubMed ID (PMID). d: The average PMID values, which reflect their recency, generated by different methods. e: The hallucination rates of different methods. f: The proportion of supported claims generated by different methods. g: The number of supported claims generated by different methods. Statistics of the human-generated answers are shown in dotted horizontal lines. Avg: the averaged metric for each model. Error bars reflect 95% confidence intervals estimated from 2,000 bootstrap iterations.
Figure 5: Identifying High-stakes Misattributions with Med-V1. a: Overview of this use case study. We extract citation statements and their source articles from clinical guidelines, and automatically check their attribution validity with Med-V1. b: Distribution of the manual validation of 50 partial contradiction and 50 strong contradiction samples. c: Topic distribution of the manually validated misattributions. d: An example of the manually-validated misattribution identified by Med-V1. In this example, the claim of a "32%" reduction (from the clinical practice guideline PMC7183940) contradicts the results presented in its cited source.
...and 1 more figures

Med-V1: Small Language Models for Zero-shot and Scalable Biomedical Evidence Attribution

TL;DR

Abstract

Med-V1: Small Language Models for Zero-shot and Scalable Biomedical Evidence Attribution

Authors

TL;DR

Abstract

Table of Contents

Figures (6)