LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs

Yujun Zhou; Jingdong Yang; Yue Huang; Kehan Guo; Zoe Emory; Bikram Ghosh; Amita Bedar; Sujay Shekar; Zhenwen Liang; Pin-Yu Chen; Tian Gao; Werner Geyer; Nuno Moniz; Nitesh V Chawla; Xiangliang Zhang

LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs

Yujun Zhou, Jingdong Yang, Yue Huang, Kehan Guo, Zoe Emory, Bikram Ghosh, Amita Bedar, Sujay Shekar, Zhenwen Liang, Pin-Yu Chen, Tian Gao, Werner Geyer, Nuno Moniz, Nitesh V Chawla, Xiangliang Zhang

TL;DR

LabSafety Bench introduces a comprehensive safety benchmark to rigorously evaluate LLMs and VLMs in laboratory contexts. By assembling 765 MCQs, 404 realistic scenarios, and 3,128 open-ended tasks across biology, chemistry, and physics, the study reveals a persistent reliability gap: even top models excel on structured questions but struggle with real-world hazard identification and consequence reasoning. The authors detail the benchmark design, evaluation protocols, and targeted enhancement methods (finetuning, tool augmentation, and RAG) and release data and code to the research community. The work highlights the urgent need for safety-focused alignment and human oversight before deploying AI in labs, and provides concrete directions to improve model safety awareness and reasoning in high-stakes environments.

Abstract

Artificial Intelligence (AI) is revolutionizing scientific research, yet its growing integration into laboratory environments presents critical safety challenges. Large language models (LLMs) and vision language models (VLMs) now assist in experiment design and procedural guidance, yet their "illusion of understanding" may lead researchers to overtrust unsafe outputs. Here we show that current models remain far from meeting the reliability needed for safe laboratory operation. We introduce LabSafety Bench, a comprehensive benchmark that evaluates models on hazard identification, risk assessment, and consequence prediction across 765 multiple-choice questions and 404 realistic lab scenarios, encompassing 3,128 open-ended tasks. Evaluations on 19 advanced LLMs and VLMs show that no model evaluated on hazard identification surpasses 70% accuracy. While proprietary models perform well on structured assessments, they do not show a clear advantage in open-ended reasoning. These results underscore the urgent need for specialized safety evaluation frameworks before deploying AI systems in real laboratory settings.

LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs

TL;DR

Abstract

LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (23)