Can Large Language Models Derive New Knowledge? A Dynamic Benchmark for Biological Knowledge Discovery

Chaoqun Yang; Xinyu Lin; Shulin Li; Wenjie Wang; Ruihan Guo; Fuli Feng; Tat-Seng Chua

Can Large Language Models Derive New Knowledge? A Dynamic Benchmark for Biological Knowledge Discovery

Chaoqun Yang, Xinyu Lin, Shulin Li, Wenjie Wang, Ruihan Guo, Fuli Feng, Tat-Seng Chua

TL;DR

This work provides the first dynamic, automatic framework for assessing the new knowledge discovery capabilities of AI systems, establishing a living, evolving resource for AI research community to catalyze the development of knowledge discovery.

Abstract

Recent advancements in Large Language Model (LLM) agents have demonstrated remarkable potential in automatic knowledge discovery. However, rigorously evaluating an AI's capacity for knowledge discovery remains a critical challenge. Existing benchmarks predominantly rely on static datasets, leading to inevitable data contamination where models have likely seen the evaluation knowledge during training. Furthermore, the rapid release cycles of modern LLMs render static benchmarks quickly outdated, failing to assess the ability to discover truly new knowledge. To address these limitations, we propose DBench-Bio, a dynamic and fully automated benchmark designed to evaluate AI's biological knowledge discovery ability. DBench-Bio employs a three-stage pipeline: (1) data acquisition of rigorous, authoritative paper abstracts; (2) QA extraction utilizing LLMs to synthesize scientific hypothesis questions and corresponding discovery answers; and (3) QA filter to ensure quality based on relevance, clarity, and centrality. We instantiate this pipeline to construct a monthly-updated benchmark covering 12 biomedical sub-domains. Extensive evaluations of SOTA models reveal current limitations in discovering new knowledge. Our work provides the first dynamic, automatic framework for assessing the new knowledge discovery capabilities of AI systems, establishing a living, evolving resource for AI research community to catalyze the development of knowledge discovery.

Can Large Language Models Derive New Knowledge? A Dynamic Benchmark for Biological Knowledge Discovery

TL;DR

Abstract

Paper Structure (25 sections, 5 figures, 1 table)

This paper contains 25 sections, 5 figures, 1 table.

Introduction
Related Work
Scientific Benchmark
Dynamic Benchmark
Automated Benchmark Construction
DBench-Bio
Data Acquisition
QA Extraction
QA Filter
Evaluation Protocol
Experiments
Experimental Setup
Evaluated Models
Implementation Details
Overall Results (RQ1)
...and 10 more sections

Figures (5)

Figure 1: The overall pipeline of DBench-Bio, which consists of three stages. (1) Data Acquisition: We source abstracts from JCR Q1 "Biology & Biochemistry" journals published post-model release to ensure rigor and prevent data leakage. (2) QA Extraction: Utilizing LLMs, we synthesize new knowledge into pairs consisting of scientific hypothesis questions and corresponding discovery answers. (3) QA Filter: We employ an LLM-based filter to remove low-quality pairs based on relevance, clarity, and centrality (see Section \ref{['sec:quality_assurance']} for details).
Figure 2: Overall results on DBench-Bio.
Figure 3: Results for agent-based methods on DBench-Bio.
Figure 4: Results across different domains on DBench-Bio.
Figure 5: Results for base models on MMLU-Pro (Biology) (bar chart) and DBench-Bio (January 2026) (line graph).

Can Large Language Models Derive New Knowledge? A Dynamic Benchmark for Biological Knowledge Discovery

TL;DR

Abstract

Can Large Language Models Derive New Knowledge? A Dynamic Benchmark for Biological Knowledge Discovery

Authors

TL;DR

Abstract

Table of Contents

Figures (5)