Omics Data Discovery Agents

Alexandre Hutton; Jesse G. Meyer

Omics Data Discovery Agents

Alexandre Hutton, Jesse G. Meyer

TL;DR

An agentic framework is presented that fetches omics-related articles and transforms the unstructured information into searchable research objects and shows that agents can identify semantically similar studies, determine data compatibility, and perform cross-study comparisons, revealing consistent protein regulation patterns in liver fibrosis.

Abstract

The biomedical literature contains a vast collection of omics studies, yet most published data remain functionally inaccessible for computational reuse. When raw data are deposited in public repositories, essential information for reproducing reported results is dispersed across main text, supplementary files, and code repositories. In rarer instances where intermediate data is made available (e.g. protein abundance files), its location is irregular. In this article, we present an agentic framework that fetches omics-related articles and transforms the unstructured information into searchable research objects. Our system employs large language model (LLM) agents with access to tools for fetching omics studies, extracting article metadata, identifying and downloading published data, executing containerized quantification pipelines, and running analyses to address novel question. We demonstrate automated metadata extraction from PubMed Central articles, achieving 80% precision for dataset identification from standard data repositories. Using model context protocol (MCP) servers to expose containerized analysis tools, our set of agents were able to identify a set of relevant articles, download the associated datasets, and re-quantify the proteomics data. The results had a 63% overlap in differentially expressed proteins when matching reported preprocessing methods. Furthermore, we show that agents can identify semantically similar studies, determine data compatibility, and perform cross-study comparisons, revealing consistent protein regulation patterns in liver fibrosis. This work establishes a foundation for converting the static biomedical literature into an executable, queryable resource that enables automated data reuse at scale.

Omics Data Discovery Agents

TL;DR

Abstract

Paper Structure (27 sections, 3 figures, 1 table)

This paper contains 27 sections, 3 figures, 1 table.

Introduction
Methods
System Architecture Overview
Article Ingestion and Metadata Extraction
Data Sources
LLM-Based Information Extraction
Evaluation of Metadata Extraction
Agent-Guided Raw Data Reanalysis
Model Context Protocol (MCP) Server Design
Containerized Analysis Tools
Agent Workflow for Quantification
Evaluation of Quantification
Cross-Study Reasoning
Study Compatibility Assessment
Results
...and 12 more sections

Figures (3)

Figure 1: Schematic of agent system. Articles and supplemental materials are obtained by the the Article Parsing agent and metadata is extracted into a database. Articles
Figure 2: Comparison of protein identifications reported by Chen et al. and that obtained by ODDA using the article content to inform parameter selection. Filtering: Both datasets were filtered to remove reverse hits, contaminants, and proteins identified only by site. Proteins were matched by UniProt Majority Protein ID and by gene name. The number of proteins with valid intensity values counted for each of the 6 samples (CCl4-1/2/3, Oil-1/2/3). For each sample, proteins present in both datasets were identified and LFQ intensity values were extracted for shared proteins. Intensities were log10-transformed for correlation analysis, then Pearson correlation was calculated per-sample and overall (pooled across all samples).
Figure 3: Distribution of articles in a UMAP. Text embeddings were computed using article abstracts and used to compute the UMAP. The three articles that were identified as semantically similar are highlighted in red.

Omics Data Discovery Agents

TL;DR

Abstract

Omics Data Discovery Agents

Authors

TL;DR

Abstract

Table of Contents

Figures (3)