Overview of TREC 2025 Biomedical Generative Retrieval (BioGen) Track

Deepak Gupta; Dina Demner-Fushman; William Hersh; Steven Bedrick; Kirk Roberts

Overview of TREC 2025 Biomedical Generative Retrieval (BioGen) Track

Deepak Gupta, Dina Demner-Fushman, William Hersh, Steven Bedrick, Kirk Roberts

Abstract

Recent advances in large language models (LLMs) have made significant progress across multiple biomedical tasks, including biomedical question answering, lay-language summarization of the biomedical literature, and clinical note summarization. These models have demonstrated strong capabilities in processing and synthesizing complex biomedical information and in generating fluent, human-like responses. Despite these advancements, hallucinations or confabulations remain key challenges when using LLMs in biomedical and other high-stakes domains. Inaccuracies may be particularly harmful in high-risk situations, such as medical question answering, making clinical decisions, or appraising biomedical research. Studies on the evaluation of the LLMs' abilities to ground generated statements in verifiable sources have shown that models perform significantly

Overview of TREC 2025 Biomedical Generative Retrieval (BioGen) Track

Abstract

Paper Structure (20 sections, 4 figures, 12 tables)

This paper contains 20 sections, 4 figures, 12 tables.

Overview
Tasks
Topics
Data
Participating Teams and Submissions
Baseline Approaches
Assessment
Task A (Grounding Answer)
Expert Evaluation
Automatic Evaluation
Task B (Reference Attribution)
Expert Evaluation
Answer quality
Citation Quality
Document relevancy
...and 5 more sections

Figures (4)

Figure 1: Sample reference answer for one of the topics of Task B of the BioGen track
Figure 3: (Task A) Comparison of the submitted runs using the expert evaluation scheme in terms of Precision (Strict and Relaxed) for the Supports class.
Figure 4: (Task A) Comparison of the submitted runs using the expert evaluation scheme in terms of SoftRecall (Strict and Relaxed) for the Supports class.
Figure 5: (Task A) Comparison of the submitted runs using the expert evaluation scheme in terms of Precision and SoftRecall for the Contradicts class. Please note that for the Contradicts class, strict and relaxed results remain the same.

Overview of TREC 2025 Biomedical Generative Retrieval (BioGen) Track

Abstract

Overview of TREC 2025 Biomedical Generative Retrieval (BioGen) Track

Authors

Abstract

Table of Contents

Figures (4)