Table of Contents
Fetching ...

Curie: Toward Rigorous and Automated Scientific Experimentation with AI Agents

Patrick Tser Jern Kon, Jiachen Liu, Qiuyi Ding, Yiming Qiu, Zhenning Yang, Yibo Huang, Jayanth Srinivasa, Myungjin Lee, Mosharaf Chowdhury, Ang Chen

TL;DR

Curie tackles the challenge of making AI-driven scientific experimentation rigorous by introducing an Experimental Rigor Engine composed of intra-agent reliability, inter-agent methodical control, and an interpretable Experiment Knowledge Module. The architecture couples Architect and Technician agents to perform end-to-end experimentation with formal validation and structured knowledge management, enabling auditable and reproducible results. A novel 46-task Experimentation Benchmark across four CS domains shows Curie achieving a 3.4x improvement over strong baselines, underscoring the value of built-in rigor for automated science. The work lays a foundation for trustworthy, scalable AI-assisted research and points to future directions in interdisciplinary deployment and knowledge reuse.

Abstract

Scientific experimentation, a cornerstone of human progress, demands rigor in reliability, methodical control, and interpretability to yield meaningful results. Despite the growing capabilities of large language models (LLMs) in automating different aspects of the scientific process, automating rigorous experimentation remains a significant challenge. To address this gap, we propose Curie, an AI agent framework designed to embed rigor into the experimentation process through three key components: an intra-agent rigor module to enhance reliability, an inter-agent rigor module to maintain methodical control, and an experiment knowledge module to enhance interpretability. To evaluate Curie, we design a novel experimental benchmark composed of 46 questions across four computer science domains, derived from influential research papers, and widely adopted open-source projects. Compared to the strongest baseline tested, we achieve a 3.4$\times$ improvement in correctly answering experimental questions. Curie is open-sourced at https://github.com/Just-Curieous/Curie.

Curie: Toward Rigorous and Automated Scientific Experimentation with AI Agents

TL;DR

Curie tackles the challenge of making AI-driven scientific experimentation rigorous by introducing an Experimental Rigor Engine composed of intra-agent reliability, inter-agent methodical control, and an interpretable Experiment Knowledge Module. The architecture couples Architect and Technician agents to perform end-to-end experimentation with formal validation and structured knowledge management, enabling auditable and reproducible results. A novel 46-task Experimentation Benchmark across four CS domains shows Curie achieving a 3.4x improvement over strong baselines, underscoring the value of built-in rigor for automated science. The work lays a foundation for trustworthy, scalable AI-assisted research and points to future directions in interdisciplinary deployment and knowledge reuse.

Abstract

Scientific experimentation, a cornerstone of human progress, demands rigor in reliability, methodical control, and interpretability to yield meaningful results. Despite the growing capabilities of large language models (LLMs) in automating different aspects of the scientific process, automating rigorous experimentation remains a significant challenge. To address this gap, we propose Curie, an AI agent framework designed to embed rigor into the experimentation process through three key components: an intra-agent rigor module to enhance reliability, an inter-agent rigor module to maintain methodical control, and an experiment knowledge module to enhance interpretability. To evaluate Curie, we design a novel experimental benchmark composed of 46 questions across four computer science domains, derived from influential research papers, and widely adopted open-source projects. Compared to the strongest baseline tested, we achieve a 3.4 improvement in correctly answering experimental questions. Curie is open-sourced at https://github.com/Just-Curieous/Curie.

Paper Structure

This paper contains 34 sections, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Curie overview.
  • Figure 2: Case Study. Curie can help researchers validate, expand, and critique existing research on the benefits of repeated sampling in LLM reasoning brown2024large. The first panel (Original Finding) presents a result from the original paper. The second panel (Reproduce) has Curie confirming this finding through rigorous experimentation. The third panel (Extend) has Curie exploring the impact of sampling temperature on repeated sampling. The final panel (Challenge) shows Curie identifying a limitation in the original methodology, suggesting an avenue for future research.
  • Figure 3: Curie workflow with an example task in LLM reasoning. The Architect is responsible for designing high-level plans and reflects on the new findings. The Technician is responsible for implementing and executing the experiments based on the plans. Whenever an agent completes its action (step , , , , ), the Experimental Rigor Engine (steps $\rightharpoonup$$\rightharpoonup$) validates the action, determines next steps, assigns tasks and maintains interpretable experimental progress, ensuring rigor throughout the entire process.
  • Figure 4: Intra-ARM setup validation high-level workflow.
  • Figure 5: Errors detected by two of Intra-ARM's many validators.
  • ...and 7 more figures