Table of Contents
Fetching ...

SciDataCopilot: An Agentic Data Preparation Framework for AGI-driven Scientific Discovery

Jiyong Rao, Yicheng Qiu, Jiahui Zhang, Juntao Deng, Shangquan Sun, Fenghua Ling, Hao Chen, Nanqing Dong, Zhangyang Gao, Siqi Sun, Yuqiang Li, Dongzhan Zhou, Guangyu Wang, Lijun Wu, Conghui He, Xuhong Wang, Jing Shao, Xiang Liu, Yu Zhu, Mianxin Liu, Qihao Zheng, Yinghui Zhang, Jiamin Wu, Xiaosong Wang, Shixiang Tang, Wenlong Zhang, Bo Zhang, Wanli Ouyang, Runkai Zhao, Chunfeng Song, Lei Bai, Chi Zhang

TL;DR

SciDataCopilot introduces a four-agent, end-to-end framework that converts heterogeneous raw scientific data into Scientific AI-Ready data to enable autonomous, task-driven scientific discovery. It structures data through a Data Access, Intent Parsing, Data Processing, and Data Integration cascade, with case-driven knowledge bases and reusable cases guiding planning and execution. Across life science, neuroscience, and earth science use cases, it achieves substantial efficiency gains, scalable data production, and auditable provenance, illustrating a path toward AGI4S-enabled experimentation. The framework emphasizes task-conditioned data representation, constraint-driven fusion, and reproducible pipelines to bridge data heterogeneity and model-driven scientific reasoning with practical impact.

Abstract

The current landscape of AI for Science (AI4S) is predominantly anchored in large-scale textual corpora, where generative AI systems excel at hypothesis generation, literature search, and multi-modal reasoning. However, a critical bottleneck for accelerating closed-loop scientific discovery remains the utilization of raw experimental data. Characterized by extreme heterogeneity, high specificity, and deep domain expertise requirements, raw data possess neither direct semantic alignment with linguistic representations nor structural homogeneity suitable for a unified embedding space. The disconnect prevents the emerging class of Artificial General Intelligence for Science (AGI4S) from effectively interfacing with the physical reality of experimentation. In this work, we extend the text-centric AI-Ready concept to Scientific AI-Ready data paradigm, explicitly formalizing how scientific data is specified, structured, and composed within a computational workflow. To operationalize this idea, we propose SciDataCopilot, an autonomous agentic framework designed to handle data ingestion, scientific intent parsing, and multi-modal integration in a end-to-end manner. By positioning data readiness as a core operational primitive, the framework provides a principled foundation for reusable, transferable systems, enabling the transition toward experiment-driven scientific general intelligence. Extensive evaluations across three heterogeneous scientific domains show that SciDataCopilot improves efficiency, scalability, and consistency over manual pipelines, with up to 30$\times$ speedup in data preparation.

SciDataCopilot: An Agentic Data Preparation Framework for AGI-driven Scientific Discovery

TL;DR

SciDataCopilot introduces a four-agent, end-to-end framework that converts heterogeneous raw scientific data into Scientific AI-Ready data to enable autonomous, task-driven scientific discovery. It structures data through a Data Access, Intent Parsing, Data Processing, and Data Integration cascade, with case-driven knowledge bases and reusable cases guiding planning and execution. Across life science, neuroscience, and earth science use cases, it achieves substantial efficiency gains, scalable data production, and auditable provenance, illustrating a path toward AGI4S-enabled experimentation. The framework emphasizes task-conditioned data representation, constraint-driven fusion, and reproducible pipelines to bridge data heterogeneity and model-driven scientific reasoning with practical impact.

Abstract

The current landscape of AI for Science (AI4S) is predominantly anchored in large-scale textual corpora, where generative AI systems excel at hypothesis generation, literature search, and multi-modal reasoning. However, a critical bottleneck for accelerating closed-loop scientific discovery remains the utilization of raw experimental data. Characterized by extreme heterogeneity, high specificity, and deep domain expertise requirements, raw data possess neither direct semantic alignment with linguistic representations nor structural homogeneity suitable for a unified embedding space. The disconnect prevents the emerging class of Artificial General Intelligence for Science (AGI4S) from effectively interfacing with the physical reality of experimentation. In this work, we extend the text-centric AI-Ready concept to Scientific AI-Ready data paradigm, explicitly formalizing how scientific data is specified, structured, and composed within a computational workflow. To operationalize this idea, we propose SciDataCopilot, an autonomous agentic framework designed to handle data ingestion, scientific intent parsing, and multi-modal integration in a end-to-end manner. By positioning data readiness as a core operational primitive, the framework provides a principled foundation for reusable, transferable systems, enabling the transition toward experiment-driven scientific general intelligence. Extensive evaluations across three heterogeneous scientific domains show that SciDataCopilot improves efficiency, scalability, and consistency over manual pipelines, with up to 30 speedup in data preparation.
Paper Structure (40 sections, 18 equations, 9 figures, 2 tables)

This paper contains 40 sections, 18 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: The Scientific AI-Ready paradigm formalizes raw data as task-conditioned, cross-modal align for autonomous scientific discovery. The proposed SciDataCopilot instantiates this paradigm through an agentic framework for transforming heterogeneous raw data into Scientific AI-Ready data.
  • Figure 2: Architecture of SciDataCopilot. The framework integrates four collaborative agents (Data Access, Intent Parsing, Data Processing, and Data Integration) to autonomously align user intents with complex data resources. Ultimately, SciDataCopilot bridges heterogeneous scientific data with specific models, re-defining task-guided data customization and cross-disciplinary integration to empower diverse scientific research tasks.
  • Figure 3: Data Access Agent. Inputs: user query $q$ and dataset root directory $R$. Output: the scientific data knowledge base $\mathcal{K}=\{D,T,C\}$, i.e., normalized data units and descriptors that serve as the shared input to downstream intent parsing, processing, and integration.
  • Figure 4: Intent Parsing Agent. Leveraging the knowledge base $\mathcal{K}=\{D,T,C\}$, the agent processes user requirements $q$ and generates an executable plan $C_q$ through requirement analysis, case adaptation or generation, and iterative strategy review.
  • Figure 5: Data Processing Agent. Given the user query $q$, the selected raw data units, and an intent-level processing plan, the agent executes reproducible pipelines to produce processed data, visualizations, and analysis summaries for downstream integration.
  • ...and 4 more figures