Table of Contents
Fetching ...

FlyAOC: Evaluating Agentic Ontology Curation of Drosophila Scientific Knowledge Bases

Xingjian Zhang, Sophia Moylan, Ziyang Xiong, Qiaozhu Mei, Yichen Luo, Jiaqi W. Ma

TL;DR

FlyAOC introduces an end-to-end agentic ontology curation benchmark for Drosophila genetics, challenging AI agents to retrieve, read, and synthesize evidence from 16,898 full-text papers to produce structured GO, expression, and synonyms annotations. It compares four baseline architectures (Memorization, Pipeline, Single-Agent, Multi-Agent) across paper budgets and backbone LLMs, finding that multi-agent designs provide the strongest retrieval efficiency while backbone scaling yields diminishing returns. Ground-truth annotations are derived from FlyBase with corpus-grounded proportions of GO ~28.9%, expression ~37.0%, and synonyms ~36.4%, enabling corpus-grounded evaluation of end-to-end curation and retrieval. The work identifies limitations such as incomplete corpus coverage (~24% post-2000) and challenges in proposing novel ontology terms, and points to future directions in retrieval strategies, intermediate representations, and interactive human-in-the-loop curation to advance retrieval-augmented scientific reasoning.

Abstract

Scientific knowledge bases accelerate discovery by curating findings from primary literature into structured, queryable formats for both human researchers and emerging AI systems. Maintaining these resources requires expert curators to search relevant papers, reconcile evidence across documents, and produce ontology-grounded annotations - a workflow that existing benchmarks, focused on isolated subtasks like named entity recognition or relation extraction, do not capture. We present FlyBench to evaluate AI agents on end-to-end agentic ontology curation from scientific literature. Given only a gene symbol, agents must search and read from a corpus of 16,898 full-text papers to produce structured annotations: Gene Ontology terms describing function, expression patterns, and historical synonyms linking decades of nomenclature. The benchmark includes 7,397 expert-curated annotations across 100 genes drawn from FlyBase, the Drosophila (fruit fly) knowledge base. We evaluate four baseline agent architectures: memorization, fixed pipeline, single-agent, and multi-agent. We find that architectural choices significantly impact performance, with multi-agent designs outperforming simpler alternatives, yet scaling backbone models yields diminishing returns. All baselines leave substantial room for improvement. Our analysis surfaces several findings to guide future development; for example, agents primarily use retrieval to confirm parametric knowledge rather than discover new information. We hope FlyBench will drive progress on retrieval-augmented scientific reasoning, a capability with broad applications across scientific domains.

FlyAOC: Evaluating Agentic Ontology Curation of Drosophila Scientific Knowledge Bases

TL;DR

FlyAOC introduces an end-to-end agentic ontology curation benchmark for Drosophila genetics, challenging AI agents to retrieve, read, and synthesize evidence from 16,898 full-text papers to produce structured GO, expression, and synonyms annotations. It compares four baseline architectures (Memorization, Pipeline, Single-Agent, Multi-Agent) across paper budgets and backbone LLMs, finding that multi-agent designs provide the strongest retrieval efficiency while backbone scaling yields diminishing returns. Ground-truth annotations are derived from FlyBase with corpus-grounded proportions of GO ~28.9%, expression ~37.0%, and synonyms ~36.4%, enabling corpus-grounded evaluation of end-to-end curation and retrieval. The work identifies limitations such as incomplete corpus coverage (~24% post-2000) and challenges in proposing novel ontology terms, and points to future directions in retrieval strategies, intermediate representations, and interactive human-in-the-loop curation to advance retrieval-augmented scientific reasoning.

Abstract

Scientific knowledge bases accelerate discovery by curating findings from primary literature into structured, queryable formats for both human researchers and emerging AI systems. Maintaining these resources requires expert curators to search relevant papers, reconcile evidence across documents, and produce ontology-grounded annotations - a workflow that existing benchmarks, focused on isolated subtasks like named entity recognition or relation extraction, do not capture. We present FlyBench to evaluate AI agents on end-to-end agentic ontology curation from scientific literature. Given only a gene symbol, agents must search and read from a corpus of 16,898 full-text papers to produce structured annotations: Gene Ontology terms describing function, expression patterns, and historical synonyms linking decades of nomenclature. The benchmark includes 7,397 expert-curated annotations across 100 genes drawn from FlyBase, the Drosophila (fruit fly) knowledge base. We evaluate four baseline agent architectures: memorization, fixed pipeline, single-agent, and multi-agent. We find that architectural choices significantly impact performance, with multi-agent designs outperforming simpler alternatives, yet scaling backbone models yields diminishing returns. All baselines leave substantial room for improvement. Our analysis surfaces several findings to guide future development; for example, agents primarily use retrieval to confirm parametric knowledge rather than discover new information. We hope FlyBench will drive progress on retrieval-augmented scientific reasoning, a capability with broad applications across scientific domains.
Paper Structure (28 sections, 4 equations, 3 figures, 12 tables)

This paper contains 28 sections, 4 equations, 3 figures, 12 tables.

Figures (3)

  • Figure 1: Gene Ontology terms form a directed acyclic graph. The highlighted term inherits from two parents.
  • Figure 2: Corpus composition by publication year (left) and venue (right). Darker bars: papers in our corpus; lighter bars: FlyBase references without open-access full text. *Publisher aggregates combining multiple journals.
  • Figure 3: Cost-performance tradeoff across architectures. The first three panels show task-specific metrics defined in Section \ref{['sec:tasks:eval']}; the fourth shows the percentage of ground-truth supporting papers (those cited by FlyBase curators) that agents retrieve. Each point is one configuration; labels (1p, 2p, ..., 16p) indicate paper reading budget per gene.