SciSafeEval: A Comprehensive Benchmark for Safety Alignment of Large Language Models in Scientific Tasks

Tianhao Li; Jingyu Lu; Chuangxin Chu; Tianyu Zeng; Yujia Zheng; Mei Li; Haotian Huang; Bin Wu; Zuoxian Liu; Kai Ma; Xuejing Yuan; Xingkai Wang; Keyan Ding; Huajun Chen; Qiang Zhang

SciSafeEval: A Comprehensive Benchmark for Safety Alignment of Large Language Models in Scientific Tasks

Tianhao Li, Jingyu Lu, Chuangxin Chu, Tianyu Zeng, Yujia Zheng, Mei Li, Haotian Huang, Bin Wu, Zuoxian Liu, Kai Ma, Xuejing Yuan, Xingkai Wang, Keyan Ding, Huajun Chen, Qiang Zhang

TL;DR

SciSafeEval introduces a large-scale, cross-domain benchmark to evaluate safety alignment of large language models in scientific tasks across textual, molecular, protein, and genomic languages. It combines domain-specific instruction prompts and hazard datasets with jailbreak stress tests to assess harmlessness, helpfulness, and refusal behaviors under zero-shot, few-shot, and chain-of-thought settings. The study evaluates a wide range of general-purpose and domain-specific LLMs, revealing persistent safety gaps, especially for smaller models and in adversarial scenarios, but also demonstrating notable gains with few-shot and CoT prompting. The authors argue for robust, adaptive safety mechanisms and multi-modal signals, and position SciSafeEval as a critical tool for advancing responsible AI deployment in scientific research.

Abstract

Large language models (LLMs) have a transformative impact on a variety of scientific tasks across disciplines including biology, chemistry, medicine, and physics. However, ensuring the safety alignment of these models in scientific research remains an underexplored area, with existing benchmarks primarily focusing on textual content and overlooking key scientific representations such as molecular, protein, and genomic languages. Moreover, the safety mechanisms of LLMs in scientific tasks are insufficiently studied. To address these limitations, we introduce SciSafeEval, a comprehensive benchmark designed to evaluate the safety alignment of LLMs across a range of scientific tasks. SciSafeEval spans multiple scientific languages-including textual, molecular, protein, and genomic-and covers a wide range of scientific domains. We evaluate LLMs in zero-shot, few-shot and chain-of-thought settings, and introduce a "jailbreak" enhancement feature that challenges LLMs equipped with safety guardrails, rigorously testing their defenses against malicious intention. Our benchmark surpasses existing safety datasets in both scale and scope, providing a robust platform for assessing the safety and performance of LLMs in scientific contexts. This work aims to facilitate the responsible development and deployment of LLMs, promoting alignment with safety and ethical standards in scientific research.

SciSafeEval: A Comprehensive Benchmark for Safety Alignment of Large Language Models in Scientific Tasks

TL;DR

Abstract

Paper Structure (48 sections, 8 figures, 15 tables)

This paper contains 48 sections, 8 figures, 15 tables.

WARNING: This paper contains hazardous or malicious content for red-teaming purpose.
Introduction
Related Work
LLMs for Scientific Tasks.
Risks of Misusing the LLMs for Scientific Tasks.
Safety Assessment of LLMs for Scientific Tasks.
The SciSafeEval Benchmark
Regulatory and Ethical Foundations for Scientific Safety
Benchmark Construction
Instruction Generation for Scientific Tasks
Substances From Hazard Databases
Chemistry.
Biology.
Medicine.
Physics.
...and 33 more sections

Figures (8)

Figure 1: Overview of the SciSafeEval benchmark for evaluating the safety alignment of LLMs in multiple scientific domains. The framework supports multiple science domains (Chemistry, Biology, Medicine, and Physics) and their corresponding specialized languages (textual, molecular, protein, and genomic). We consider both harmful and benign query purposes in SciSafeEval.
Figure 2: Overview of the construction process for the SciSafeEval dataset, using the Gene Sequence Generation (GSG) task in Biology as an example.
Figure 3: Harmlessness scores of the LLMs in the 0-shot, five-shot and CoT prompting settings.
Figure 4: Heatmap of refusal rate. All numbers represent percentages indicating the proportion of prompts successfully rejected by the model. Left: 0-shot, Middle: five-shot, Right: chain-of-thought (CoT). Darker shades indicate higher safety performance.
Figure 5: Trade-off between harmlessness and helpfulness for various scientific tasks for Claude-3.5 and Qwen-2.5-7B.
...and 3 more figures

SciSafeEval: A Comprehensive Benchmark for Safety Alignment of Large Language Models in Scientific Tasks

TL;DR

Abstract

SciSafeEval: A Comprehensive Benchmark for Safety Alignment of Large Language Models in Scientific Tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (8)