Table of Contents
Fetching ...

LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs

Yujun Zhou, Jingdong Yang, Yue Huang, Kehan Guo, Zoe Emory, Bikram Ghosh, Amita Bedar, Sujay Shekar, Zhenwen Liang, Pin-Yu Chen, Tian Gao, Werner Geyer, Nuno Moniz, Nitesh V Chawla, Xiangliang Zhang

TL;DR

LabSafety Bench introduces a comprehensive safety benchmark to rigorously evaluate LLMs and VLMs in laboratory contexts. By assembling 765 MCQs, 404 realistic scenarios, and 3,128 open-ended tasks across biology, chemistry, and physics, the study reveals a persistent reliability gap: even top models excel on structured questions but struggle with real-world hazard identification and consequence reasoning. The authors detail the benchmark design, evaluation protocols, and targeted enhancement methods (finetuning, tool augmentation, and RAG) and release data and code to the research community. The work highlights the urgent need for safety-focused alignment and human oversight before deploying AI in labs, and provides concrete directions to improve model safety awareness and reasoning in high-stakes environments.

Abstract

Artificial Intelligence (AI) is revolutionizing scientific research, yet its growing integration into laboratory environments presents critical safety challenges. Large language models (LLMs) and vision language models (VLMs) now assist in experiment design and procedural guidance, yet their "illusion of understanding" may lead researchers to overtrust unsafe outputs. Here we show that current models remain far from meeting the reliability needed for safe laboratory operation. We introduce LabSafety Bench, a comprehensive benchmark that evaluates models on hazard identification, risk assessment, and consequence prediction across 765 multiple-choice questions and 404 realistic lab scenarios, encompassing 3,128 open-ended tasks. Evaluations on 19 advanced LLMs and VLMs show that no model evaluated on hazard identification surpasses 70% accuracy. While proprietary models perform well on structured assessments, they do not show a clear advantage in open-ended reasoning. These results underscore the urgent need for specialized safety evaluation frameworks before deploying AI systems in real laboratory settings.

LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs

TL;DR

LabSafety Bench introduces a comprehensive safety benchmark to rigorously evaluate LLMs and VLMs in laboratory contexts. By assembling 765 MCQs, 404 realistic scenarios, and 3,128 open-ended tasks across biology, chemistry, and physics, the study reveals a persistent reliability gap: even top models excel on structured questions but struggle with real-world hazard identification and consequence reasoning. The authors detail the benchmark design, evaluation protocols, and targeted enhancement methods (finetuning, tool augmentation, and RAG) and release data and code to the research community. The work highlights the urgent need for safety-focused alignment and human oversight before deploying AI in labs, and provides concrete directions to improve model safety awareness and reasoning in high-stakes environments.

Abstract

Artificial Intelligence (AI) is revolutionizing scientific research, yet its growing integration into laboratory environments presents critical safety challenges. Large language models (LLMs) and vision language models (VLMs) now assist in experiment design and procedural guidance, yet their "illusion of understanding" may lead researchers to overtrust unsafe outputs. Here we show that current models remain far from meeting the reliability needed for safe laboratory operation. We introduce LabSafety Bench, a comprehensive benchmark that evaluates models on hazard identification, risk assessment, and consequence prediction across 765 multiple-choice questions and 404 realistic lab scenarios, encompassing 3,128 open-ended tasks. Evaluations on 19 advanced LLMs and VLMs show that no model evaluated on hazard identification surpasses 70% accuracy. While proprietary models perform well on structured assessments, they do not show a clear advantage in open-ended reasoning. These results underscore the urgent need for specialized safety evaluation frameworks before deploying AI systems in real laboratory settings.

Paper Structure

This paper contains 67 sections, 23 figures, 2 tables.

Figures (23)

  • Figure 1: Overview of LabSafety Bench. a, illustrates how undetected AI hallucinations can pose risks of laboratory incidents. b, outlines the benchmarking process used to assess these risks in AI models, c, provides simplified examples from the benchmark, d, summarizes the development pipeline, and e, shows the number of benchmark questions in different forms. f, reports the performance of top-performing models in different subjects.
  • Figure 2: Model Performance on MCQs. a, model Performance on Text-only MCQs in LabSafety Bench with 0-shot setting. b, accuracy (%) of 5 top-performing models across 10 different categories for text-only MCQs in the 0-shot setting without CoT and hints. c, model performance on Text-with-image questions. d, model performance on questions sourced from official university training materials versus those generated for this benchmark.
  • Figure 3: Models Performance (%) on Scenario-based Tests. a, the performance of models on the five subjects in the Hazards Identification Test. b, the performance of models on the five subjects in the Consequence Identification Test. c, the performance of models on the four tasks in the Hazards Identification Test. d, Models performance on Hazards Identification Test with varied response points constraints. In a and b, for each subject, we computed the average score for each model (shown in the last row), and for each model, we calculated the overall average score on all questions (shown in the last column). In both cases, the highest score is highlighted in bold and the second-highest score is underlined.
  • Figure 4: Simplified examples of common errors made by GPT-4o. a, an example of hallucination in the MCQ CoT answer. b, an example of a lack of comprehensiveness in the Hazards Identification Test. Blue highlights indicate key but non-incorrect information found in the question or answer. Green marks the correct answer. Red highlights denote errors in the response, while bold red emphasizes the fundamental cause of the mistake.
  • Figure 5: Results of Different Enhancement Methods on the Lab Safety Bench. a, the performance heatmap of Llama-3-8B-Instruct across various fine-tuning datasets and testing dataset configurations. Each column corresponds to a distinct training dataset, and each row represents a specific testing dataset. The color intensity of each cell indicates the model's accuracy/score when trained on the column dataset and evaluated on the row dataset, along with the non-finetuned model and the best-performing models for each task (GPT-4o, GPT-4o-mini, and Deepseek-r1, respectively, from the 15 models tested). b, the performance comparison of ChemCrow and Baseline models across the three evaluation tasks. c, the performance comparison of baseline vs. RAG-enhanced models on three tasks.
  • ...and 18 more figures