Table of Contents
Fetching ...

Autonomous Laboratory Agent via Customized Domain-Specific Language Model and Modular AI Interface

Zhuo Diao, Kouma Matsumoto, Linfeng Hou, Hayato Yamashita, Masayuki Abe

TL;DR

A system architecture that addresses a fundamental challenge in deploying language-model agents for autonomous control of scientific instrumentation: ensuring reliability in safety-critical environments by separating intent interpretation, experimental planning, and command verification, providing a pathway toward scalable autonomous laboratories.

Abstract

We introduce a system architecture that addresses a fundamental challenge in deploying language-model agents for autonomous control of scientific instrumentation: ensuring reliability in safety-critical environments. The framework integrates probabilistic reasoning by domain-specialized language models with deterministic execution layers that enforce constraints through structured validation and modular orchestration. By separating intent interpretation, experimental planning, and command verification, the architecture translates high-level scientific goals into verifiable experimental actions. We demonstrate this approach in real-time atomic-resolution scanning probe microscopy experiments operated at room temperature, where the system autonomously generates control strategies, invokes corrective modules, and maintains stable operation under experimentally challenging conditions. Quantitative evaluations show that domain-adapted small language models achieve high routing robustness and command accuracy while operating on consumer-grade hardware. Beyond a specific instrument, the framework establishes a general computational principle for deploying language-model agents in safety-critical experimental workflows, providing a pathway toward scalable autonomous laboratories.

Autonomous Laboratory Agent via Customized Domain-Specific Language Model and Modular AI Interface

TL;DR

A system architecture that addresses a fundamental challenge in deploying language-model agents for autonomous control of scientific instrumentation: ensuring reliability in safety-critical environments by separating intent interpretation, experimental planning, and command verification, providing a pathway toward scalable autonomous laboratories.

Abstract

We introduce a system architecture that addresses a fundamental challenge in deploying language-model agents for autonomous control of scientific instrumentation: ensuring reliability in safety-critical environments. The framework integrates probabilistic reasoning by domain-specialized language models with deterministic execution layers that enforce constraints through structured validation and modular orchestration. By separating intent interpretation, experimental planning, and command verification, the architecture translates high-level scientific goals into verifiable experimental actions. We demonstrate this approach in real-time atomic-resolution scanning probe microscopy experiments operated at room temperature, where the system autonomously generates control strategies, invokes corrective modules, and maintains stable operation under experimentally challenging conditions. Quantitative evaluations show that domain-adapted small language models achieve high routing robustness and command accuracy while operating on consumer-grade hardware. Beyond a specific instrument, the framework establishes a general computational principle for deploying language-model agents in safety-critical experimental workflows, providing a pathway toward scalable autonomous laboratories.
Paper Structure (6 sections, 10 figures, 2 tables)

This paper contains 6 sections, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Two-stage autonomy framework enabled by a fine-tuned, domain-specific small language model (SLM), demonstrated in real-time scanning probe microscopy (SPM) experiments. The framework visualizes the interaction among user instructions, SLM-generated outputs, and corresponding experimental results. More detailed system outputs and experimental traces are provided in Fig. S2 and Fig. S3. (a) Stage i: Direct execution of user-issued control commands generated by the SLM. (b) Rejection of invalid or out-of-specification instructions through constraint-aware validation. (c) Stage ii: Autonomous formulation and execution of multi-step experimental plans based on high-level user intent.
  • Figure 2: System architecture of the fine-tuned SLM-driven experimental automation framework, designed to enable reliable and autonomous operation of scientific instrumentation through modular orchestration and constraint-aware execution, implemented in SPM. (a) User interface enabling chat-based experiment control and real-time visualization of SPM data, as well as the SPM knowledge-base answering. (b) Local deployment of three SLMs, including a router SLM that interprets user inputs and assigns tasks to either a knowledge-base SLM or a command SLM. The command SLM can access the the operation API and AI module integrated in a digitally enhanced SPM platform. (d) Input data routing with a dynamic adapter injection scheme.
  • Figure 4: (a) Classification accuracy of the router SLM evaluated using 4-bit-quantized Phi-4, Mistral-v0.3, and Llama-3.2 models. The Knowledge-based, Command, and Others categories correspond to labels A, B, and C in the confusion matrix, respectively, while other unexpected outputs are assigned to a None label. Normalized values and the corresponding sample counts (shown in parentheses) are summarized in the annotations. (b) Performance evaluation of Phi-4, Mistral-v0.3, and Llama-3.2 models, assessed in terms of token generation speed, GPU memory usage, perplexity, BERT F1 score, and GEval score. (c) Performance evaluation of the Command SLM in Stage i and Stage ii. The black dashed line represents the inference accuracy of OpenAI o4-mini. For Stage i, bar plots (left axis) indicate generation accuracy, while line plots (right axis) show GPU memory consumption during inference. Results demonstrate systematic performance gains enabled by domain-specific adaptation across model variants.
  • Figure 5: Distribution of error types demonstrating the effect of domain-specific fine-tuning on model reliability. (a) Error breakdown for the original, quantized, and fine-tuned Phi-4 models, showing substantial reduction of argument, instruction-following, and format errors after fine-tuning, with remaining errors primarily associated with specification awareness. (b) Comparison with OpenAI o4-mini and fine-tuned Llama-3.2, Mistral-v0.3, and Phi-4 models, indicating that domain-adapted compact models achieve comparable or improved reliability compared to cloud deployed LLM.
  • Figure S1: Representative input-output examples of the router SLM when classifying user inputs into three categories (Knowledge-based, Command, and Others) using Phi-4 as the base model. The fine-tuned knowledge-based SLM demonstrates long-form text generation capabilities, including structured explanations of state-of-the-art SPM concepts using domain-specific scientific terminology, academically grounded descriptions incorporating physical formulas, and the generation of SPM-specific code. In addition, the command SLM generates instrument commands that are compatible with command-line parsing. When the user input is classified as “others,” the system responds using a standard instruction-based LLM strategy. Distinct system prompts are employed for the three inference tasks, and their details are summarized in the Section \ref{['sec:prompt']}.
  • ...and 5 more figures