Table of Contents
Fetching ...

BioDisco: Multi-agent hypothesis generation with dual-mode evidence, iterative feedback and temporal evaluation

Yujing Ke, Kevin George, Kathan Pandya, David Blumenthal, Maximilian Sprang, Gerrit Großmann, Sebastian Vollmer, David Antony Selby

TL;DR

BioDisco introduces a modular, multi‑agent framework for grounded biomedical hypothesis generation that jointly leverages biomedical knowledge graphs and live literature. The system uses specialized agents (Background, Explorer, Scientist, Critic, Reviewer, Refiner, Planner) within an iterative feedback loop and validates hypotheses with temporal held‑out evaluation, Bradley‑Terry paired comparisons, and Bayesian Rasch human analysis. Temporal predictions on unseen datasets and ablation studies show that dual‑mode grounding and iterative refinement improve novelty and significance beyond generalist biomedical agents. An open‑source Python package enables researchers to deploy BioDisco with customizable LLMs and knowledge graphs, advancing scalable, evidence-grounded discovery while acknowledging limitations in verifiability and real‑world validation.

Abstract

Identifying novel hypotheses is essential to scientific research, yet this process risks being overwhelmed by the sheer volume and complexity of available information. Existing automated methods often struggle to generate novel and evidence-grounded hypotheses, lack robust iterative refinement and rarely undergo rigorous temporal evaluation for future discovery potential. To address this, we propose BioDisco, a multi-agent framework that draws upon language model-based reasoning and a dual-mode evidence system (biomedical knowledge graphs and automated literature retrieval) for grounded novelty, integrates an internal scoring and feedback loop for iterative refinement, and validates performance through pioneering temporal and human evaluations and a Bradley-Terry paired comparison model to provide statistically-grounded assessment. Our evaluations demonstrate superior novelty and significance over ablated configurations and generalist biomedical agents. Designed for flexibility and modularity, BioDisco allows seamless integration of custom language models or knowledge graphs, and can be run with just a few lines of code.

BioDisco: Multi-agent hypothesis generation with dual-mode evidence, iterative feedback and temporal evaluation

TL;DR

BioDisco introduces a modular, multi‑agent framework for grounded biomedical hypothesis generation that jointly leverages biomedical knowledge graphs and live literature. The system uses specialized agents (Background, Explorer, Scientist, Critic, Reviewer, Refiner, Planner) within an iterative feedback loop and validates hypotheses with temporal held‑out evaluation, Bradley‑Terry paired comparisons, and Bayesian Rasch human analysis. Temporal predictions on unseen datasets and ablation studies show that dual‑mode grounding and iterative refinement improve novelty and significance beyond generalist biomedical agents. An open‑source Python package enables researchers to deploy BioDisco with customizable LLMs and knowledge graphs, advancing scalable, evidence-grounded discovery while acknowledging limitations in verifiability and real‑world validation.

Abstract

Identifying novel hypotheses is essential to scientific research, yet this process risks being overwhelmed by the sheer volume and complexity of available information. Existing automated methods often struggle to generate novel and evidence-grounded hypotheses, lack robust iterative refinement and rarely undergo rigorous temporal evaluation for future discovery potential. To address this, we propose BioDisco, a multi-agent framework that draws upon language model-based reasoning and a dual-mode evidence system (biomedical knowledge graphs and automated literature retrieval) for grounded novelty, integrates an internal scoring and feedback loop for iterative refinement, and validates performance through pioneering temporal and human evaluations and a Bradley-Terry paired comparison model to provide statistically-grounded assessment. Our evaluations demonstrate superior novelty and significance over ablated configurations and generalist biomedical agents. Designed for flexibility and modularity, BioDisco allows seamless integration of custom language models or knowledge graphs, and can be run with just a few lines of code.

Paper Structure

This paper contains 42 sections, 7 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: A high level overview of our automated framework for hypothesis generation. Overseen by a planner, agents search academic literature and query a knowledge graph to obtain articles and subgraphs relevant to a user-specified research topic. A scientist agent integrates these sources to derive initial hypotheses, which are rated by a critic, then refined with additional background, discarded or presented to the user with supporting evidence
  • Figure 2: Bio-Disco architecture. Each agent has a distinct role, some augmented with external tools. Agents interact sequentially: the user’s input is first processed by the Background agent to generate a topical summary, guiding KG extraction by the Explorer and initial hypothesis generation by the Scientist. Each hypothesis undergoes an iterative cycle of evidence retrieval and refinement. Finally, evaluations from the Critic agent are used to identify the most promising hypotheses. Here, Lit refers to the literature interface, KG to the KG interface, and BG to the generated background.
  • Figure 3: Violin plot demonstrating that hypotheses generated by Bio-Disco are semantically more similar to 'gold' hypotheses than gold hypotheses are to other hypotheses. Top distribution shows pairwise similarity of unrelated gold hypotheses; bottom shows similarity of Bio-Disco-generated hypotheses to gold standard for the same topics
  • Figure 4: Centipede plot of ability scores for Bio-Disco, four ablation configurations and Biomni, with 95% comparison intervals. A multi-agent system clearly outperforms a single LLM (GPT-4.1) generating novel, significant hypotheses; tool use (i.e. KG and literature search) and iterative refinement each yield further improvements. Biomni, a competing system, is mostly better than a single LLM and produces verifiable hypotheses, but less novel and significant than the full Bio-Disco framework
  • Figure 5: Ratings given by two independent groups of human experts to 10 hypotheses generated for respective topics of cardiovascular disease (CVD) and immunology
  • ...and 3 more figures