Table of Contents
Fetching ...

AutoLabs: Cognitive Multi-Agent Systems with Self-Correction for Autonomous Chemical Experimentation

Gihan Panapitiya, Emily Saldanha, Heather Job, Olivia Hess

TL;DR

AutoLabs introduces a self-correcting, multi-agent framework that translates natural-language experimental goals into executable chemical protocols with hardware-ready output. Its LangGraph-based supervisor and specialized sub-agents, combined with tool-calling and two self-check regimes, enable robust design, validation, and execution guidance across five benchmark experiments. A systematic ablation shows reasoning capacity as the key driver of quantitative accuracy, with near-expert procedural fidelity ($F1>0.89$) achieved when a multi-agent, fully reasoning configuration is used, especially with guided self-checks. The work highlights the practical value of modular agent architectures, self-correction loops, and human-in-the-loop collaboration, while outlining future directions such as SOP integration via retrieval-based methods and memory-enabled, agent-evolving systems to further reliability in autonomous laboratories.

Abstract

The automation of chemical research through self-driving laboratories (SDLs) promises to accelerate scientific discovery, yet the reliability and granular performance of the underlying AI agents remain critical, under-examined challenges. In this work, we introduce AutoLabs, a self-correcting, multi-agent architecture designed to autonomously translate natural-language instructions into executable protocols for a high-throughput liquid handler. The system engages users in dialogue, decomposes experimental goals into discrete tasks for specialized agents, performs tool-assisted stoichiometric calculations, and iteratively self-corrects its output before generating a hardware-ready file. We present a comprehensive evaluation framework featuring five benchmark experiments of increasing complexity, from simple sample preparation to multi-plate timed syntheses. Through a systematic ablation study of 20 agent configurations, we assess the impact of reasoning capacity, architectural design (single- vs. multi-agent), tool use, and self-correction mechanisms. Our results demonstrate that agent reasoning capacity is the most critical factor for success, reducing quantitative errors in chemical amounts (nRMSE) by over 85% in complex tasks. When combined with a multi-agent architecture and iterative self-correction, AutoLabs achieves near-expert procedural accuracy (F1-score > 0.89) on challenging multi-step syntheses. These findings establish a clear blueprint for developing robust and trustworthy AI partners for autonomous laboratories, highlighting the synergistic effects of modular design, advanced reasoning, and self-correction to ensure both performance and reliability in high-stakes scientific applications. Code: https://github.com/pnnl/autolabs

AutoLabs: Cognitive Multi-Agent Systems with Self-Correction for Autonomous Chemical Experimentation

TL;DR

AutoLabs introduces a self-correcting, multi-agent framework that translates natural-language experimental goals into executable chemical protocols with hardware-ready output. Its LangGraph-based supervisor and specialized sub-agents, combined with tool-calling and two self-check regimes, enable robust design, validation, and execution guidance across five benchmark experiments. A systematic ablation shows reasoning capacity as the key driver of quantitative accuracy, with near-expert procedural fidelity () achieved when a multi-agent, fully reasoning configuration is used, especially with guided self-checks. The work highlights the practical value of modular agent architectures, self-correction loops, and human-in-the-loop collaboration, while outlining future directions such as SOP integration via retrieval-based methods and memory-enabled, agent-evolving systems to further reliability in autonomous laboratories.

Abstract

The automation of chemical research through self-driving laboratories (SDLs) promises to accelerate scientific discovery, yet the reliability and granular performance of the underlying AI agents remain critical, under-examined challenges. In this work, we introduce AutoLabs, a self-correcting, multi-agent architecture designed to autonomously translate natural-language instructions into executable protocols for a high-throughput liquid handler. The system engages users in dialogue, decomposes experimental goals into discrete tasks for specialized agents, performs tool-assisted stoichiometric calculations, and iteratively self-corrects its output before generating a hardware-ready file. We present a comprehensive evaluation framework featuring five benchmark experiments of increasing complexity, from simple sample preparation to multi-plate timed syntheses. Through a systematic ablation study of 20 agent configurations, we assess the impact of reasoning capacity, architectural design (single- vs. multi-agent), tool use, and self-correction mechanisms. Our results demonstrate that agent reasoning capacity is the most critical factor for success, reducing quantitative errors in chemical amounts (nRMSE) by over 85% in complex tasks. When combined with a multi-agent architecture and iterative self-correction, AutoLabs achieves near-expert procedural accuracy (F1-score > 0.89) on challenging multi-step syntheses. These findings establish a clear blueprint for developing robust and trustworthy AI partners for autonomous laboratories, highlighting the synergistic effects of modular design, advanced reasoning, and self-correction to ensure both performance and reliability in high-stakes scientific applications. Code: https://github.com/pnnl/autolabs

Paper Structure

This paper contains 20 sections, 1 equation, 12 figures.

Figures (12)

  • Figure 1: Schematic of the multi-agent AutoLabs system.
  • Figure 2: Format of an experiment step. (A) A regular chemical addition step. (B) A parameter setting step. (C) A chemical transfer step. (D) A vial transfer step: To differentiate it from a chemical transfer step, additional details are specified within brackets. For instance, the StartVialTimer instruction indicates initiating a timer to track the duration a vial remains in Plate 2. This ensures uniform timing for all vials at the transferred location prior to the next operation.
  • Figure 3: Diagram illustrating the metrics computation for an example set of generated steps from Experiment 3. TP, FP and FN stand for True Positives, False Positives and False Negatives. $X_{c=aqamm, v=A4}^{GT}$ and $X_{c=aqamm, v=A4}^{gen}$ are specific examples of the values used in the computation of the nRMSE, defined in Equation \ref{['eq:1']}.
  • Figure 4: Average F1 scores (top) and nRMSE scores (bottom) for each experiment across multiple architecture configurations in the fully automated evaluation mode.
  • Figure 5: Performance Trends Across Cognitive Configurations and Experiment Complexity. (A) Average F1-score and nRMSE scores across all experiments for each architecture configuration. Color indicates average token usage the LLM. (B) F1 Score vs. Experiment Complexity: Line plots illustrate how step-generation accuracy (F1) changes with increasing experiment complexity (experiments 1–5). (C) nRMSE vs. Experiment Complexity: Line plots depict chemical amount error trends over experimental complexity.
  • ...and 7 more figures