Large Language Models as Biomedical Hypothesis Generators: A Comprehensive Evaluation

Biqing Qi; Kaiyan Zhang; Kai Tian; Haoxiang Li; Zhang-Ren Chen; Sihang Zeng; Ermo Hua; Hu Jinfang; Bowen Zhou

Large Language Models as Biomedical Hypothesis Generators: A Comprehensive Evaluation

Biqing Qi, Kaiyan Zhang, Kai Tian, Haoxiang Li, Zhang-Ren Chen, Sihang Zeng, Ermo Hua, Hu Jinfang, Bowen Zhou

TL;DR

Biomedical knowledge growth outpaces hypothesis generation, motivating a comprehensive evaluation of LLMs as biomedical hypothesis generators. The study constructs a temporally split background–hypothesis dataset, evaluates zero-shot, few-shot, and fine-tuning settings, and introduces tool use and multi-agent uncertainty, along with four evaluation metrics; findings show LLMs can generate novel hypotheses on unseen literature, with uncertainty enhancing diversity but not always benefiting when additional knowledge is added. The work demonstrates LLMs' potential as biomedical hypothesis generators, aided by a temporally controlled corpus and a multidimensional evaluation framework. The study also shows that uncertainty-driven, agent-based collaboration can enhance diversity, while excess external knowledge requires careful modulation; future work should expand knowledge sources and tooling.

Abstract

The rapid growth of biomedical knowledge has outpaced our ability to efficiently extract insights and generate novel hypotheses. Large language models (LLMs) have emerged as a promising tool to revolutionize knowledge interaction and potentially accelerate biomedical discovery. In this paper, we present a comprehensive evaluation of LLMs as biomedical hypothesis generators. We construct a dataset of background-hypothesis pairs from biomedical literature, carefully partitioned into training, seen, and unseen test sets based on publication date to mitigate data contamination. Using this dataset, we assess the hypothesis generation capabilities of top-tier instructed models in zero-shot, few-shot, and fine-tuning settings. To enhance the exploration of uncertainty, a crucial aspect of scientific discovery, we incorporate tool use and multi-agent interactions in our evaluation framework. Furthermore, we propose four novel metrics grounded in extensive literature review to evaluate the quality of generated hypotheses, considering both LLM-based and human assessments. Our experiments yield two key findings: 1) LLMs can generate novel and validated hypotheses, even when tested on literature unseen during training, and 2) Increasing uncertainty through multi-agent interactions and tool use can facilitate diverse candidate generation and improve zero-shot hypothesis generation performance. However, we also observe that the integration of additional knowledge through few-shot learning and tool use may not always lead to performance gains, highlighting the need for careful consideration of the type and scope of external knowledge incorporated. These findings underscore the potential of LLMs as powerful aids in biomedical hypothesis generation and provide valuable insights to guide further research in this area.

Large Language Models as Biomedical Hypothesis Generators: A Comprehensive Evaluation

TL;DR

Abstract

Paper Structure (34 sections, 1 equation, 9 figures, 14 tables)

This paper contains 34 sections, 1 equation, 9 figures, 14 tables.

Introduction
Preliminary
Problem Definition
Dataset Construction
Can LLMs Truly Generate Zero-Shot Hypotheses?
Experiment Setup
Experiment Results
Results of Zero-shot Setting
Results of External Knowledge
Quantitative Analysis on Uncertainty
Human Evaluation and Case Study
Can agent collaboration enhance LLMs' zero-shot generalization?
Multi-agent Framework
Experiment Results
Conclusion
...and 19 more sections

Figures (9)

Figure 1: This illustration demonstrates a generated hypothesis using the fine-tuned 65B LLaMA model within our specially constructed dataset. The generated hypothesis closely aligns with the findings in existing literature published subsequent to the training sets.
Figure 2: (a) The iterative loop of scientific discovery involves a cyclical process: observations and data from previous experiments are analyzed, leading to the generation of new hypotheses. These hypotheses then guide the design of subsequent experiments, producing fresh data to perpetuate the cycle. (b) We execute the automated data partitioning pipeline, using publication dates as the basis, to mitigate the risk of data contamination.
Figure 3: This figure displays the BLEU scores on both seen and unseen datasets.
Figure 4: This figure depicts a comparative analysis of multiple models across distinct prompting paradigms, such as zero-shot, sampled and similarity retrieval-based few-shot.
Figure 5: This figure elucidates the correlation between uncertainty and evaluation scores for all models, encompassing both zero-shot and few-shot settings, and incorporating both sampled and similarity retrieval few-shot prompts.
...and 4 more figures

Large Language Models as Biomedical Hypothesis Generators: A Comprehensive Evaluation

TL;DR

Abstract

Large Language Models as Biomedical Hypothesis Generators: A Comprehensive Evaluation

Authors

TL;DR

Abstract

Table of Contents

Figures (9)